Minireview
Connecting the dots between genes, biochemistry, and disease susceptibility: systems biology modeling in human genetics

https://doi.org/10.1016/j.ymgme.2004.10.006Get rights and content

Abstract

Understanding how DNA sequence variations impact human health through a hierarchy of biochemical and physiological systems is expected to improve the diagnosis, prevention, and treatment of common, complex human diseases. We have previously developed a hierarchical dynamic systems approach based on Petri nets for generating biochemical network models that are consistent with genetic models of disease susceptibility. This modeling approach uses an evolutionary computation approach called grammatical evolution as a search strategy for optimal Petri net models. We have previously demonstrated that this approach routinely identifies biochemical network models that are consistent with a variety of genetic models in which disease susceptibility is determined by nonlinear interactions between two or more DNA sequence variations. We review here this approach and then discuss how it can be used to model biochemical and metabolic data in the context of genetic studies of human disease susceptibility.

Introduction

Moore [26] has proposed that epistasis or gene–gene interactions are a ubiquitous component of the genetic architecture of common human diseases. There are several lines of evidence that suggest epistasis is likely to play a significant role in determining susceptibility to common human diseases. First, the idea that epistasis is important is not new [2]. More importantly, it is an idea that has prevailed over the last 100 years. Second, the ubiquity of biomolecular interactions in gene regulation and biochemical and metabolic systems suggest that relationship between DNA sequence variations and clinical endpoints is likely to involve gene–gene interactions. Third, positive results from studies of single polymorphisms typically do not replicate across independent samples. This is true for both linkage and association studies [1], [15]. Moore and Williams [35] have suggested that the reason single-locus results do not replicate is because epistasis is more important. Fourth, gene–gene interactions are commonly found when properly investigated [46]. When powerful computational and statistical approaches such as multifactor dimensionality reduction [13], [14], [42], [43] and set association analysis [16], [48] are applied to genetic association studies, evidence for epistasis is often detected for clinical endpoints such as sporadic breast cancer [42], coronary artery restenosis [50], essential hypertension [49], atrial fibrillation [47], and type II diabetes [4]. Collectively, these observations suggest that epistasis is common and should be taken into consideration along with gene–environment interactions in genetic studies of common diseases.

While the detection of epistasis is a significant challenge, it has been suggested that the interpretation of epistasis models is even more difficult [34]. This is because the number of genotype combinations goes up exponentially for each additional locus in the model. Consider, for example, the four-locus epistasis model for sporadic breast cancer described by Ritchie et al. [42]. This model consists of four single nucleotide polymorphisms (SNPs) in three estrogen metabolism genes with a total of 81 genotype combinations. Connecting the high-risk and low-risk genotype combinations with what is known about the estrogen metabolism pathway is extremely difficult since much of what is know about the function of this pathway is based on experiments that involve one gene at a time. If we are to successfully use genetic information at the public health level to improve the diagnosis, prevention, and treatment of common diseases we will need to understand how DNA sequence information influences human health through hierarchical networks of biochemical, metabolic, and physiological systems. The goal of this paper is to review recent developments in the use of a discrete dynamic systems modeling tool called Petri nets (PNs) for developing biochemical and metabolic systems models that are consistent with genetic models of disease susceptibility. We begin with a simple example of epistasis in the form of a penetrance function. We then provide a general introduction to PNs, an overview of a PN modeling strategy, and recent results using the approach. We end with some ideas about how this modeling tool might be useful for the study of biochemical systems data in the context of human disease susceptibility and genetics.

Table 1, Table 2 illustrate two examples of epistasis models in the form of penetrance functions that specify the probability of disease given a particular genotype or genotype combination. When allele frequencies are equal and genotypes are consistent with Hardy–Weinberg proportions these models exhibit multilocus genetic effects in the absence of any independent main effects The first model was initially described by Li and Reich [23] and later by Moore et al. [32]. This model is based on the nonlinear XOR function that generates an interaction effect in which high risk of disease is dependent on inheriting a heterozygous genotype (Aa) from one locus or a heterozygous genotype (Bb) from a second locus, but not both. The XOR function is commonly used in data mining examples because it generates a pattern that is not linearly separable. The high-risk genotype combinations are AaBB, Aabb, AABb, and aaBb, all with penetrances of 0.1 (see Table 1). The second model was initially described by Frankel and Schork [10] and later by Moore et al. [32]. In this model, high risk of disease is dependent on inheriting exactly two high-risk alleles (A and/or B) from two different loci. For this model, the high-risk genotype combinations are AAbb, AaBb, and aaBB with penetrances of 0.1, 0.05, and 0.1, respectively (see Table 2). These models are important because they illustrate it is possible to have interaction effects in the absence of any detectable independent main effect. Moore et al. [32], [33] have shown that it is possible to generate thousands of such models consisting of two–five loci. It is not currently known whether such models are consistent with any particular human disease. However, these models are helpful for the development and evaluation of new analytical strategies for detecting interactions since they represent perhaps worst-case scenarios in which no single locus has a detectable effect.

Petri nets (PNs) are a type of directed graph that can be used to model discrete dynamical systems [5]. Goss and Peccoud [12] demonstrated that PNs could be used to model molecular interactions in biochemical systems. The core PN consists of two different types of nodes: places and transitions. Using the biochemical systems analogy of Goss and Peccoud [12], places represent molecular species. Each place has a certain number of tokens that represent the number of molecules for that particular molecular specie. A transition is analogous to a molecular or chemical reaction and is said to fire when it acquires tokens from a source place and, after a possible delay, deposits tokens in a destination place. Tokens travel from a place to a transition or from a transition to a place via arcs with specific weights or bandwidths that are analogous to stoichiometric coefficients. While the number of tokens transferred from place to transition to place is determined by the arc weights, the rate at which the tokens are transferred is determined by the delay associated with the transition. Transition behavior is also constrained by the weights of the source and destination arcs. A transition will only fire if two preconditions are met: (1) if the source place can completely use the capacity of the source arc and (2) if the destination place has the capacity available to store the number of tokens provided by the full weight of the destination arc. Transitions without an input arc, act as if they are connected to a limitless supply of tokens. Similarly, transitions without an output arc can consume a limitless supply of tokens. The transition firing rate can be immediate, delayed deterministically or stochastically depending on the complexity needed. The fundamental behavior of a PN can be controlled by varying the maximum number of tokens a place can hold, the weight of each arc, and the firing rates of the transitions. Fig. 1 summarizes the basic PN elements and their biochemical analogies.

Murata [36] reviews several different types of PN. For example, hierarchical PNs allow a given PN to be a node in a second PN. Stochastic PN have transitions that fire according to probability distributions while timed PNs have transitions that fire at deterministic times. There are also hybrid and colored PNs that have additional modeling flexibility such as allowing places that can be continuous instead of discrete.

Moore and Hahn [27] developed a strategy for identifying PN models of biochemical systems that are consistent with genetic models of disease susceptibility. The specific PNs used to model the biochemical pathways are timed PNs [25], [39]. Transitions had either a fixed delay or fired as soon as the preconditions of the transition were met. If a place provided input to two or more transitions but had only enough tokens to satisfy one transition, then the transition with the shortest delay fired. If a place provided input to two or more transitions and had enough tokens to satisfy more than one transition, then the timers associated with both transitions began to count down. When the timers had counted down to 0, the transition fired unless two transitions were simultaneously ready to fire in which case one of the transitions is chosen to fire and the other transition(s) reset.

The goal of identifying PN models of biochemical systems that are consistent with genetic models of disease susceptibility is accomplished by developing PNs that are dependent on specific genotypes from two or more genetic variations. Moore and Hahn [27] make firing rates of transitions and/or arc weights genotype-dependent yielding different PN behavior. Each PN model is related to the genetic model using a discrete version of the threshold model from population genetics [7]. With a classic threshold or liability model, it is the concentration of a biochemical or environmental substance that is related to the risk of disease, under the hypothesis that risk of disease is greatly increased once a particular substance exceeds some threshold concentration. Conversely, the risk of disease may increase in the absence of a particular factor or with any significant deviation from a reference level. In such cases, high or low levels are associated with high risk while an intermediate level is associated with low risk. Moore and Hahn [27] use a discrete version of this model for deterministic PNs. For each model, the number of tokens at a particular place is recorded and if they exceed a certain threshold, the appropriate-risk assignment is made. If the number of tokens does not exceed the threshold, the alternative-risk assignment is made. The high-risk and low-risk assignments made by the discrete threshold from the output of the PN can then be compared to the high-risk and low-risk genotypes from the genetic model. A perfect match indicates the PN model is consistent with the genetic model. The PN then becomes a model that relates the genetic variations to risk of disease through an intermediate biochemical network. This general strategy is illustrated in Fig. 2.

Identifying PN models that are consistent with the genotype-dependent distribution of risk is challenging by hand. Therefore, Moore and Hahn [27] developed an evolutionary computing approach to the discovery of Petri net models. This approach is described in the next section.

Evolutionary computation arose from early work on evolutionary programming [8], [9] and evolution strategies [40], [44] that used simulated evolution for artificial intelligence. The focus on representations at the genotypic level lead to the development of genetic algorithms by Holland [17], [18] and others. Genetic algorithms have become a popular machine intelligence strategy because they can be effective for implementing parallel searches of rugged fitness landscapes [11]. Briefly, this is accomplished by generating a random population of models or solutions, evaluating their ability to solve the problem at hand, selecting the best models or solutions, and generating variability in these models by exchanging model components among different models. The process of selecting models and introducing variability is iterated until an optimal model is identified or some termination criteria are satisfied. A limitation of genetic algorithms is that models or solutions are represented by linear arrays of bits. In response to this limitation, Koza [21] developed a more flexible evolutionary computation strategy called genetic programming where the models or solutions are represented by binary expression trees. Koza et al. [22] and Kitagawa and Iba [20] have applied genetic programming to modeling metabolic networks.

Grammatical evolution has been described by O’Neill and Ryan [37] as a variation on genetic programming. Here, a Backus–Naur Form (BNF) grammar is specified that allows a computer program or model to be constructed by a simple genetic algorithm operating on an array of bits. The ability to specify a grammar is appealing because only a text file specifying the grammar needs to be altered for different applications. There is no need to modify and recompile source code during development once the fitness function is specified. The end result is a decrease in development time and an increase in computational flexibility. It is for this reason that Moore and Hahn [27] selected grammatical evolution instead of genetic programming as the evolutionary computation method for the discovery of PN models.

Moore and Hahn [27] developed a grammar for PNs in BNF. Backus–Naur Form is a formal notation for describing the syntax of a context-free grammar as a set of production rules that consist of terminals and nonterminals [24]. Nonterminals form the left-hand side of production rules while both terminals and nonterminals can form the right-hand side. A terminal is essentially a model element while a nonterminal is the name of a production rule. For the Petri net models, the terminal set includes, for example, the basic building blocks of a Petri net: places, arcs, and transitions. The nonterminal set includes the names of production rules that construct the Petri net. For example, a nonterminal might name a production rule for determining whether an arc has weights that are fixed or genotype-dependent. We show below in (1) the production rule that was executed to begin the model building process for the study by Moore and Hahn [27]. This root production rule is larger than that used by Moore and Hahn [27] for modeling penetrance functions consisting of two DNA sequence variations. Here, there is an extra DNA sequence variation that needs to be included. We have also added an additional transition and an additional arc to the root model.root::=pick_a_genepick_a_genenet_iterationsexprtransitionplace_noarc

When the initial 〈root〉 production rule is executed, a single Petri net place with no entering or exiting arc (i.e., 〈place_noarc〉) is selected and a transition leading into or out of that place is selected. The arc connecting the transition and place can be dependent on the genotypes of the genes selected by 〈pick_a_gene〉. The nonterminal 〈expr〉 is a function that allows the Petri net to grow. The production rule for 〈expr〉 is shown below in (2). Here, the selection of one of the four nonterminals (0, 1, 2, or 3) in the right-hand side of the production rule is determined by a combination of bits in the genetic algorithm chromosome.expr::=exprexpr0|arc1|transition2|place3

The base or minimum PN that is constructed using the 〈root〉 production rule consists of a single place, a single transition, and an arc that connects the transition to the place. Multiple calls to the production rule 〈expr〉 by the genetic algorithm chromosome can build any connected PN. In addition, the number of times the PN is to be iterated is selected with the nonterminal 〈net_iterations〉. Many other production rules control the arc weights, the genotype-dependent arcs and transitions, the number of initial tokens in a place, the place capacity, etc. All decisions made in the building of the PN model are made by each subsequent bit or combination of bits in the genetic algorithm chromosome. The complete grammar is too large for detailed presentation here.

Once a Petri net model is constructed using the BNF grammar, as instructed by the genetic algorithm chromosome, the model fitness is determined. As described by Moore and Hahn [27], [28], [29], this is carried out by executing the Petri net model for each combination of genotypes in the genetic dataset and comparing the final token counts at a defined place to a threshold constant to determine the risk assignment. Let G be the set of i = 1–n possible genotype combinations where n = 27, for example, when there are three genetic variations, each with three genotypes. Let Zi be the final number of tokens from the designated Petri net place for the ith genotype combination and let c be the threshold constant. Let d (Gi) be the risk assignment for the ith genotype combination in the genetic model and let f (Gi) be the risk assignment made by the Petri net. If Zi  c then f (Gi) = “high-risk” else if Zi < c then f (Gi) = “low-risk.” The dichotomous risk assignment is consistent with epidemiological study designs in which subjects with the disease (cases) and subjects without the disease (controls) are used to identify genetic-risk factors. Genotypes that are more common in cases than controls can be thought of as high risk. Fitness (E) of the Petri net model is determined by comparing the high-risk and low-risk assignments made by the Petri net to those from the given nonlinear gene–gene interaction model. Calculation of the fitness value, E, is given by the classification error function in (3). In the present study, max (E) = 27 and min (E) = 0. The goal is to minimize E.E=i=1|G|ei,whereei=0iff(Gi)=d(Gi)ei=1iff(Gi)(Gi)

Moore and Hahn [27], [29] demonstrated that the combined PN and grammatical evolution strategy was capable of routinely generating biochemical systems models that are consistent with two-locus epistasis models. Fig. 2 illustrates a model that was identified by Moore and Hahn [27] that is consistent with the two-locus epistasis model illustrated in Table 2. This PN model consists of one place or molecular specie with a transition leading to the place and one leading away from the place. Here, the first transition (T0) and the two arcs (A0, A1) are dependent on genotypes from the two SNPs (G0, G1). For example, the transition fires every four time steps for the AA genotype, every two time steps for the Aa genotype, and every time step for the aa genotype. Thus, this single locus exhibits some deviation due to dominance. This same locus impacts the second arc with some evidence of heterosis. That is, the AA and aa genotypes have higher arc diameters than the heterozygote. This locus also exhibits pleiotropy because it impacts more than one part of the network. Collectively, these two SNPs interact in a nonadditive manner through this biochemical network.

Moore and Hahn [28] extended the PN analysis to three-locus epistasis models in the absence of main effects. However, this analysis did not routinely generate models with perfect fits. In fact, only one perfect model was identified out of 100 runs. The reduction in performance was due to an increase in the number of multilocus genotypes from 9 to 27. Thus, the complexity of the model fitting task was significantly higher. Moore and Hahn [30], [31] demonstrated that modification of the grammar significantly increased performance. The grammar was modified in several ways. First, a grammar was evaluated that required at least one or two PN elements to be dependent on genotypes from each of the loci in the model. Second, a grammar that uniformly generates PN models from the search space was implemented. Both of these new grammars routinely yielded perfect PN models. The uniform grammar had the best performance with perfect models discovered about 1/3 of the time. It should be noted that improving the grammar was far more effective than changing genetic algorithm parameters such as population size or number of generations.

There are several general systems biology strategies that can be utilized to study the genetics of susceptibility to common human diseases. One strategy relies on the collection and integrated analysis of genetic, genomic, and proteomic data from complex biological systems [41]. A second strategy relies on perturbation of complex biological systems in model organisms [19]. A third strategy relies on computer simulations for carrying out thought experiments that can be used to generate testable hypotheses [6]. The PN studies reviewed here illustrate that it is possible to carry out thought experiments about biochemical and metabolic systems models that are consistent with disease susceptibility models. While these models most likely do not represent real biochemical or metabolic systems, they may be useful for thinking about general mechanisms that might have a biological basis. For example, it is important to note that many of the PN models developed by Moore and Hahn [27], [29] for two-locus epistasis models are architecturally simple even though they are functionally complex. The PN illustrated in Fig. 2 and described above consists of a single molecular specie and two transitions. This observation could be used to generate hypotheses about biochemical systems that could then be tested using perturbation experiments in model organisms. We discuss in the next section some ideas about how this strategy might be useful for the analysis of real biochemical systems data in the context of human disease.

Summar et al. [45] have developed the concept of environmentally determined gene expression or EDGE as a framework for the study of gene–gene and gene–environment interactions in human disease. The EDGE framework is based on the idea that (1) genetically encoded differences in expressed proteins react differently under increasingly extreme environmental conditions, (2) the resulting phenotype or pathology is determined through a combination of the functional magnitude of the genetic change and the severity of the environmental stimulus, (3) rare genetic disorders represent one extreme, resulting in a phenotype, while extreme environmental conditions represent the other, and (4) the majority of diseases lie between these two extremes. Summar et al. [45] illustrate the EDGE concept using polymorphisms and mutations in genes that play an important role in the rate-limiting part of the urea cycle.

Carbamyl phosphate synthetase I (CPSI) determines the rate-limiting entry of free ammonia into the urea cycle. Disruption of CPSI affects the ability of the liver to remove waste nitrogen and produce arginine, citrulline, and urea. At one end of the EDGE spectrum is classic CPSI deficiency with corresponding hyperammonemia that is the result of primarily genetic causes in newborns. At the other end of the spectrum is suboptimal urea cycle production that greatly impacts morbidity and mortality following bone marrow transplantation [45]. Here, the phenotype is not observed unless both the predisposing genotypes and the environmental impact of bone marrow transplantation occur together.

The urea cycle is an ideal place to begin applying the PN modelling strategy for several reasons. First, polymorphisms and mutations in genes that encode enzymes from the urea cycle have been associated with both rare and common human diseases. Second, the enzymes and metabolites of the urea cycle are easily measured. Third, functional studies are possible in both in vitro and in vivo systems. Finally, Petri net models of the urea cycle have already been developed [3] thus providing a starting point for discrete dynamic systems modeling. We anticipate the urea cycle will be amenable to systems biology modeling in the context of human disease. Further, it represents one of the few biochemical systems where it will be possible to study the genotype–phenotype relationship with measured information about enzymes and metabolites.

How will PN be useful for modeling the urea cycle? As an example, consider the relationship between the urea cycle and neonatal pulmonary hypertension. Pearson and Summar [38] have shown that infants with persistent pulmonary hypertension have low plasma concentrations of arginine and nitric oxide metabolites. In this study, infants with pulmonary hypertension had a significantly lower mean plasma concentration of arginine than the control group with respiratory distress (20.2 ± 8.8 μmol/L versus 39.8 ± 17 μmol/L, respectively, P < 0.001). Further, infants with pulmonary hypertension had a significantly lower mean plasma concentration of nitric oxide metabolites than the control group (18.8 ± 12.7 μmol/L versus 47.2  ± 11.2 μmol/L, respectively, P = 0.05). Interestingly, Pearson and Summar [38] showed that the T1450N polymorphism in the CPSI gene was associated with the persistent pulmonary hypertension phenotype. In fact, the AA genotype was completely absent from the pulmonary hypertension group and was less frequent than expected in the entire study population of infants with some degree of respiratory distress.

Given a genetic model of pulmonary hypertension based on polymorphisms in genes from the urea cycle, the PN strategy reviewed here could be used to generate hypotheses about the relationship between arginine, nitric oxide metabolites, and other metabolites for different genotype combinations. These hypotheses could then be tested using observational data as described above or with experimental data as described by Summar et al. [45]. For example, preliminary work with the T1405N polymorphism suggests a mild effect on the in vitro kinetic activity of the CPSI enzyme while expression studies demonstrate lower catalytic activity when COS-7 cells are co-transfected with both the A and C forms of CPSI compared to A or C alone [45]. Studies using stable isotope tracers may shed light on the in vivo effects of these polymorphisms. We see the PN strategy reviewed here as playing an important role in generating models that can be used to guide these types of experimental studies. Once all the functional data has been collected, a detailed and accurate PN model of the urea cycle can then be constructed similar to the one developed by Chen and Hofstadt [3].

While we have provided an example where biochemical data can be modeled in the context of a genetic model of disease susceptibility, it is the promise of systems biology to provide biomolecular data at different levels of the genotype to phenotype hierarchy to more fully understand disease etiology. For the urea cycle example, this might include detailed descriptions of the transcriptional regulatory network and physiological systems such as blood pressure homeostasis that are impacted by the urea cycle. While we focused primarily on biochemical systems as one level in this hierarchy, the PN approach can certainly be extended to account for additional layers of complexity. For example, Murata [36] reviewed hierarchical PNs allow a given PN to be a node in a second PN. This variation on the PN approach would allow transcriptional regulatory networks to feed into metabolic systems that in turn feed into physiological systems. Thus, it is possible to extend the PN approach reviewed here to carry out though experiments about more complex biological hierarchies that might be consistent with a genetic mode of disease susceptibility.

Section snippets

Summary and conclusions

We have reviewed here a bioinformatics strategy for the automatic discovery of discrete dynamic systems model of biochemistry and metabolism that are consistent with genetic models of human disease susceptibility. This Petri net modeling approach makes it possible to carry out thought experiments in the computer that may be useful for generating hypotheses that can be tested in vitro or in vivo. Ultimately, modeling strategies such as this one will be useful for systems biology modeling of the

Acknowledgments

This work was supported by National Institutes of Health Grants HL65234, AI59694, HD047447, and ES09915. We thank Dr. Lance Hahn for his excellent technical assistance in the development, evaluation, and implementation of the Petri net algorithms and software.

References (50)

  • L.J. Fogel

    Autonomous automata

    Ind. Res.

    (1962)
  • L.J. Fogel et al.

    Artificial Intelligence Through Simulated Evolution

    (1966)
  • W.N. Frankel et al.

    Who’s afraid of epistasis?

    Nat. Genet.

    (1996)
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization, and Machine Learning

    (1989)
  • P.J. Goss et al.

    Quantitative modeling of stochastic systems in molecular biology by using stochastic Petri nets

    Proc. Nat. Acad. Sci. USA

    (1998)
  • L.W. Hahn et al.

    Ideal discrimination of discrete clinical endpoints using multilocus genotypes

    In Silico Biol.

    (2004)
  • L.W. Hahn et al.

    Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions

    Bioinformatics

    (2003)
  • J. Hoh et al.

    Trimming, weighting, and grouping SNPs in human case-control association studies

    Genome Res.

    (2001)
  • J.H. Holland, Adaptive plans optimal for payoff-only environments, in: Proceedings of the 2nd Hawaii International...
  • J.H. Holland

    Adaptation in Natural and Artificial Systems

    (1975)
  • R.C. Jansen

    Studying complex biological systems using multifactorial perturbation

    Nat. Rev. Genet.

    (2003)
  • J. Kitagawa et al.

    Identifying metabolic pathways and gene regulation networks with evolutionary algorithms

  • J.R. Koza

    Genetic programming: on the programming of computers by means of natural selection

    (1992)
  • J.R. Koza et al.

    Reverse engineering of metabolic pathways from observed data using genetic programming

    Pacific Symposium on Biocomputing

    (2001)
  • W. Li et al.

    A complete enumeration and classification of two-locus disease models

    Hum. Hered.

    (2000)
  • Cited by (27)

    • A 3-factor epistatic model predicts digital ulcers in Italian scleroderma patients

      2010, European Journal of Internal Medicine
      Citation Excerpt :

      Yet, whilst the candidate-gene approach we used may ignore a number of potentially important SNPs, on the other hand it greatly reduces the chances of reporting false-positive findings and of describing a spurious result [21]. The biological interpretation of statistical models of epistasis is a challenging task as what is found at the population level does not necessarily reflect what ultimately happens at the biological level [36]. To solve this issue we applied a discrete dynamic system tool, specifically Petri nets, to provide a plausible biological interpretation of the MDR model.

    • Epistasis and Its Implications for Personal Genetics

      2009, American Journal of Human Genetics
      Citation Excerpt :

      Therefore, our knowledge of the diversity of genetic models underlying common diseases is in its infancy. However, we can begin to think about biological plausibility via computational thought experiments.33 To this end, Moore and Hahn developed a computational system for discovering systems biology models that are consistent with epistasis models such as XOR.34

    View all citing articles on Scopus
    View full text