202
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Modeling gene content across a phylogeny to determine when genes become associated

ORCID Icon, ORCID Icon &
Received 13 Dec 2022, Accepted 10 Mar 2024, Published online: 25 Mar 2024

Abstract

We consider a model for inferring functional links between genes. We begin with the simple case of two genes whose presence or absence evolves stochastically along a phylogenetic tree. We develop a hidden Markov model where the hidden states of the model correspond to whether or not the genes perform a joint function. In the case that two genes do perform a joint function, the rates of gain or loss of each gene depend on the presence or absence of the other gene. Otherwise, those two genes are assumed to be gained and lost independently. Using simulation, we investigate the conditions under which the package corHMM can infer the hidden state correctly, and we also investigate when the Akaike information criterion (AIC) has the power to reject the simpler model when it is incorrect. We find that we can more accurately determine the dependent and independent rate class regimes when the trees have more tips and when the differences in the two rate classes are larger. We find the accuracy of corHMM is not overly affected by whether or not there are multiple transitions between the rate classes or just a single transition. We show how the two-gene case can be extended to a more general n-gene model with a level-dependent quasi-birth-and-death (LD-QBD) framework. We assume that the level n of the QBD corresponds to the number of genes that are required for some beneficial function, and the phases within each level record the presence/absence of particular genes.

1. Introduction

In this work, we develop a stochastic model of gene gain and loss with the aim of inferring when (if at all), in evolutionary history, an association between two (or more) genes arises. The data we consider is a species tree along with information on the presence or absence of genes in each species. One biological motivation for our model is that if genes are involved in the same biochemical pathway and therefore they are both required for some function, then the rate of gain or loss of one gene in the pathway should depend upon the presence or absence of the other gene in the pathway[Citation12]. However, if genes are not functionally linked, then the rate of gain or loss of one gene should be independent of the state of the other genes[Citation24, Citation25]. Functional linkage could also occur if multiple genes code for subunits that form oligomeric proteins[Citation17].

We do not explicitly model the mechanism by which genes are gained, but the assumptions of our model are consistent with some of the ways new genes are thought to arise in nature. Genes can be gained via de novo gene birth, or by gene duplication followed by subsequent subfunctionalization or neofunctionalization (sometimes known as the patchwork model[Citation30]), or by horizontal gene transfer (HGT). Of these mechanisms, de novo gene gain seems like it would be highly unlikely to occur independently in different evolutionary lineages, but there is evidence for convergent gene gain via both HGT[Citation2, Citation15] and gene duplication followed by subsequent subfunctionalization or neofunctionalization[Citation11].

We are not the first to consider mathematical models of how genes gain and maintain interactions. Francis and Tanaka[Citation12] developed a model for the evolution of bacterial genetic pathways (where genes were gained via HGT) under the assumption that the complete pathway was required to confer any selective benefit. They considered the dynamics of such a system in a single species rather than looking at patterns of gene presence and absence over a phylogeny. Lynch proposed models[Citation17, Citation18] for the evolution of multimeric protein assemblages - his models focused on population genetic aspects such as mutation and drift rather than looking at a phylogenetic scale.

The model proposed in this article takes a more phylogenetic approach compared to the models mentioned above but could be applied in similar biological settings. It builds upon some existing models for the evolution of discrete traits. Barker et al.[Citation3] and Barker and Pagel[Citation4] were the first to note that the presence and absence of two genes can be modeled as a four-state Markov process. We assume that states {00,01,10,11} record each gene’s presence (1) or absence (0). Each transition can only have one gene loss or gain (e.g. 00→01 or 00→10 are permitted but not 00→11). To test if two genes were associated, Barker et al.[Citation3] and Barker and Pagel[Citation4] used likelihood ratio tests to compare a model where substitution rates were consistent with independent evolution (e.g., q0001=q1011) versus a model where rates were dependent. Marazzi et al.[Citation19] proposed a precursor model for the evolution of a binary trait. Instead of transiting directly from 0→1 or 1→0, they developed a model with an intermediate unobservable state. Boyko and Beaulieu[Citation7] further generalized this idea to models with hidden rate classes. For example, two different rate classes may have the same observable states but different transition rates among the states.

The following are our contributions in this article. First, in Section 2, we construct a two-gene model based on the dependent and independent models considered by Barker et al.[Citation3] as two hidden rate classes in the framework proposed by Boyko and Beaulieu[Citation7]. We simulate data under this two-gene model and determine under what conditions a shift from the independent rates class to the dependent rates class can be detected using a hidden Markov model as implemented in the R package corHMM from Boyko and Beaulieu[Citation7]. For example, how large a tree is required and how large a shift in the rates is needed before the Akaike information criterion (AIC) supports a model with two rate classes over a simpler model with just one rate class? If a model with two rate classes is preferred, can it correctly detect where on the evolutionary tree the shift (neofunctionaliszation) occurred?

Further, in Section 3, building on the theory of the quasi-birth-and-death models (QBDs) in matrix-analytic methods[Citation16, Citation21], we present an extension of our two-gene model to the case where an association arises between more than two genes. We construct a level-dependent QBD (LD-QBD) to model the processes by which the association between N−1 genes can be extended to N genes for some N > 2. In our model, we assume that the rates of gene gain or loss may depend on the number of genes that have become associated. We then present numerical examples to illustrate biological insights that may be gained from the analysis of the stationary distribution.

Finally, in Section 4, we discuss a wide range of possible applications and possible extensions of our model to the analysis of biological gene presence-absence data.

2. Neofunctionalization two-gene model I

First, we propose the following continuous-time Markov chain {X(t):t0} with state space (1) S={00,01,10,11,00~,01~,10~,11~},(1) where 1 and 0 correspond to the presence or absence of a gene, states in the subset Sind={00,01,10,11}S correspond to the case where the gain/loss of each gene does not depend on the presence/absence of the other gene, and states in the subset Sd={00~,01~,10~,11~}S correspond to genes that do depend on one another. For example, state X(t)=11~ means that both genes are present in the species, and are also dependent on one another, at time t.

The transition rates of this process are collected in the generator matrix A=[Ai,j]i,jS given by, (2) A=0001101100~01~10~11~0001101100~01~10~11~[q12q1300000q210q240000q310q3400000q42q43000λ\hlineq1000q12~q13~00q200q21~0q24~00q30q31~0q12~000q40q42~q43~],(2) where 0 entries correspond to events that may not occur; the rates of transitions from being dependent to being independent are set to q1,q2,q3,q4; the transition from independence to dependence may only occur from state 11 to state 11~, at some rate denoted λ; and the ondiagonals, marked with *, are given by Ai,i=jiAi,j, for all i,jS. An alternative assumption that could be made about the transition from an independent to a dependent state is that a change in selective pressure could occur regardless of whether both genes are present, in this case, the diagonal entries in the upper right block of matrix A would all be set to λ. This latter assumption might be more appropriate in the case where genes are gained via horizontal transfer, whereas the assumption we make here perhaps fits better with gene gain being the result of duplication followed by subsequent subfunctionalization or neofunctionalization[Citation30].

We assume that genes are gained or lost at some Poisson rates αk and βk, per gene k. Therefore, when in one of the ‘independent-genes’ states in the set 𝒮ind, we have α1=q13=q24 and α2=q12=q34, which are some Poisson rates at which species may gain gene 1 and gene 2, respectively; while β1=q31=q42 and β2=q21=q43, which are some Poisson rates at which species may lose gene 1 and gene 2, respectively.

For our simulation study, we make some further simplifying assumptions about the model in Equation2. We assume that both genes are gained/lost at the same rate. We further assume that the transition from dependence to independence occurs at a constant rate, denoted by q1=q2=q3=q4=γ.

Finally, we denote by r00,r01,r10,r11,r¯01,r¯10>0 the parameters which modify the base rates of transition between states within the set 𝒮d, so that given some state i𝒮d, gene k=1,2 is gained at rate αkri or lost at rate βkr¯i.

By the above assumptions, the generator matrix 𝐀 of the resulting process, referred to as Model I, is given by (3) A=0001101100~01~10~11~0001101100~01~10~11~[α2α100000β20α10000β10α200000β1β2000λγ000α2r00α1r0000γ00β2r010α1r¯0100γ0β1r100α2r¯10000γ0β1r11β2r11].(3)

In the mutation-selection framework, beneficial mutations are more likely to fix in a population, so the rate of substitution of a beneficial mutation should be higher[Citation13, Citation20, Citation22]. Similarly, deleterious mutations are less likely to be fixed in a population, so the rate of losing one of those genes should be lower[Citation8, Citation23, Citation32]. We consider that when two genes become associated, i.e., both are needed to perform some useful function, then a transition from state 11 to state 11~ occurs. Since the function is useful, we assume that the rate of losing a gene, when both genes are present (11~ moves to 10~ or 01~), would be substantially less than the rate of losing a gene when one of the genes is already absent (10~ moves to 00~, or 01~ moves to 00~). Therefore, r11≪1 and for similar reasons r¯01,r¯10>1. In contrast, we expect r00, r01, and r10 to be ≈1. The reason is that when one of the genes is absent, the useful function that the genes perform together has already been lost, so a further loss will not be protected by selection compared to the independent case. Here we enforce that these rates (r00, r01, and r10) are exactly equal to 1 as we assume that any other shifts in selective pressure between the dependent and independent regimes will be minor, furthermore, it is advantageous to reduce the number of parameters that corHMM needs to estimate.

2.1. Simulation study

For all simulations and example figures in this section, species trees were simulated under a birth-and-death process with constant rates using the sim.bdtree function from the R package geiger[Citation26]. The speciation rate was set to 1, and the extinction rate was 0. 7. Given a species tree, we then randomly generated the presence or absence of genes for each species on the tree, using two different transition rate classes (𝐐Ind and 𝐐Dep). In our first two sets of simulations, we picked a single node in the tree where a transition from the independent to the dependent regime occurred. The node was selected randomly, but with the constraint that the descendants of the node were no less than 25% of the total taxa and no more than 75%.

We considered two main measures of performance. The first was whether a Hidden Markov Model (HMM) following the structure given in Equation (3) gave a better fit than a standard Markov model with only one transition rate class. We compared it to a standard Markov model with one transition rate class that either had the same structure as 𝐐Ind or 𝐐Dep. We compared the AICc values between a two-rate classes HMM model and a standard Markov model following Anderson and Burnham’s[Citation1] interpretation of the difference in AICc between models, where a difference of 2 or more indicates that the model with the hidden rate classes is considered significantly better compared to a simpler model.

The second measure of performance is related to whether the transition points between the independent and dependent regimes can be identified. To infer the ancestral states, we first applied the HMM model in corHMM to estimate the transition rates in 𝐀. We then applied the function makeSimmap from the corHMM package to simulate the ancestral states along the tree, given the inferred transition rates. For our model, the function makeSimmap assigns one of 8 different states to each segment of the branches. However, we only take note of whether or not the state is in the independent/dependent rate class. Although the results of makeSimmap are stochastic, we found that multiple runs produced similar results, so we report only one “simmap” per simulation. We computed the percentage of matching branch length between the true generating model and the “simmap” based on the inferred model.

To illustrate the structure of the data we generate, in , we present an example of a 50-taxon species tree and the presence or absence of two genes. In the blue portion of the tree shown in , the two genes are associated and having them both is beneficial. This means that the rate of loss of either gene from a state with both genes is very low, in this example, all tip species in this portion of the tree have both genes. In the rest of the tree, the genes are not associated and we see all possible combinations of presence and absence.

Figure 1. A phylogeny with the presence or absence of two genes shown at the tips. Gene 1 (circles) and gene 2 (squares), are either absent (red) or present (black) in each of the species at the tips of the tree. Black edges correspond to the parts of the tree where the genes are independent of each other, while blue edges indicate the two genes are dependent.

Figure 1. A phylogeny with the presence or absence of two genes shown at the tips. Gene 1 (circles) and gene 2 (squares), are either absent (red) or present (black) in each of the species at the tips of the tree. Black edges correspond to the parts of the tree where the genes are independent of each other, while blue edges indicate the two genes are dependent.

In their analysis of the corHMM package, Boyko and Beaulieu[Citation7] found that datasets simulated with a mean rate of 0. 1 for transitions between hidden rate classes and between observable states gave more accurate results than simulations using higher mean rates. In keeping with this, our first simulation used the generator matrices 𝐐Ind, 𝐐Dep given by (4) QInd=[0.040.0400.0300.040.0300.0400.030.03],QDep=[0.040.0400.0300.40.0300.400.003750.00375].(4)

These matrices correspond to values of r¯01,r¯10=10 and r01 = 0. 125 in Equation (Equation3). We generated 100 species trees, each with 150, 200, or 250 tips. The results of the simulation are shown in .

Figure 2. Panel (a) shows the AICc difference between the HMM with two hidden rate classes and a standard Markov with the same structure as 𝐐Ind. Panel (b) shows the AICc difference between the HMM and a standard Markov model with the same structure as 𝐐Dep. Panel (c) shows the percentage of the edge length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees each of size of 150, 200 and 250. The true generating model used rates from Equation (Equation4) and had a single transition from independence to dependence.

Figure 2. Panel (a) shows the AICc difference between the HMM with two hidden rate classes and a standard Markov with the same structure as 𝐐Ind. Panel (b) shows the AICc difference between the HMM and a standard Markov model with the same structure as 𝐐Dep. Panel (c) shows the percentage of the edge length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees each of size of 150, 200 and 250. The true generating model used rates from Equation (Equation4(4) QInd=[∗0.040.0400.03∗00.040.030∗0.0400.030.03∗],QDep=[∗0.040.0400.03∗00.40.030∗0.400.003750.00375∗].(4) ) and had a single transition from independence to dependence.

We found that, while it was possible to correctly identify that a model with two hidden rate classes was superior to a model with a single rate class (Figure 2, panels a and b), it was not possible to accurately identify where in the tree the transition occurred (Figure 2, panel c).

In their study, Barker et al.[Citation3] investigated rates of gene gain or loss in the range [0. 1, 6] (per unit of branch length). They found that the best values for being able to identify an association between genes were in the range [0. 8, 1. 5]. In our second set of simulations, we used higher rates of gene gain and loss to investigate if it would lead to better performance in detecting where transitions in the hidden rate classes occur. We chose the rates of gene gain or loss as follows: (5) QInd=[330203203022],QDep=[330208h208h012h12h],(5) where h=1,2,3. In terms of Equation (Equation3) this corresponds to r¯01,r¯10=83h and r11=14h. Results relating to model selection are shown in . Similarly to our first set of simulations, the AICc differences in show that the two rate-class HMMs are usually significantly better than a standard Markov model.

Figure 3. The AICc difference between the HMM and the standard Markov models under different values of h. (a) shows the AICc difference between the HMM and a standard Markov, which has the same structure as 𝐐Ind, and (b) shows the AICc difference between the HMM and a standard Markov model, which has the same structure as 𝐐Dep The dashed horizontal line is the value equal to 2.

Figure 3. The AICc difference between the HMM and the standard Markov models under different values of h. (a) shows the AICc difference between the HMM and a standard Markov, which has the same structure as 𝐐Ind, and (b) shows the AICc difference between the HMM and a standard Markov model, which has the same structure as 𝐐Dep The dashed horizontal line is the value equal to 2.

We were particularly interested to see if larger rates of gene gain and loss would lead to better performance in identifying where on the tree transitions between the independent and dependent rate classes occurred. shows two contrasting results, in (a) and (b), there is a 94. 4% match between the true and inferred states with respect to dependence and independence, however, in (c) and (d) there is only a 42. 3% match. These two examples were both generated with the same choice of 𝐐Ind and 𝐐Dep and illustrate that there is a lot of variability in how well we are able to recover the hidden state.

Figure 4. Results of two simulation instances using the values of 𝐐Ind, 𝐐Dep in Equation (Equation5) with h = 3. Panels (a) and (c) are examples of the true generating model. Panels (b) and (d) are examples of a character history simulated under the HMM using “simmap”. Black edges correspond to the times when the genes are independent of each other (transition rates in 𝐐Ind), and blue edges indicate the dependence of the two genes (transition rates in 𝐐Dep). In panel (b) and (d) black edges indicate the inferred independence of the two genes, and the red lines correspond to the inferred dependence.

Figure 4. Results of two simulation instances using the values of 𝐐Ind, 𝐐Dep in Equation (Equation5(5) QInd=[∗3302∗0320∗3022∗],QDep=[∗3302∗08h20∗8h012h12h∗],(5) ) with h = 3. Panels (a) and (c) are examples of the true generating model. Panels (b) and (d) are examples of a character history simulated under the HMM using “simmap”. Black edges correspond to the times when the genes are independent of each other (transition rates in 𝐐Ind), and blue edges indicate the dependence of the two genes (transition rates in 𝐐Dep). In panel (b) and (d) black edges indicate the inferred independence of the two genes, and the red lines correspond to the inferred dependence.

indicates that bigger differences between QInd and QDep make it easier to accurately assign ancestral states. shows that when h = 3 it is possible to accurately infer which edges are in the dependent versus independent state, provided the number of taxa is more than 80. We note that more taxa might be required for different values of h.

Figure 5. The percentage of the edge-length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees of size 50. The true generating model used rates from Equation (Equation5) with h{1,2,3}. The red dashed line at 0. 5 is the expected match if we randomly assign the ancestral state (i.e., guessing).

Figure 5. The percentage of the edge-length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees of size 50. The true generating model used rates from Equation (Equation5(5) QInd=[∗3302∗0320∗3022∗],QDep=[∗3302∗08h20∗8h012h12h∗],(5) ) with h∈{1,2,3}. The red dashed line at 0. 5 is the expected match if we randomly assign the ancestral state (i.e., guessing).

Figure 6. The percentage of the edge length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees each of size of 20,50,80,110,140, and 170. The true generating model used rates from Equation (Equation5) with h = 3.

Figure 6. The percentage of the edge length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees each of size of 20,50,80,110,140, and 170. The true generating model used rates from Equation (Equation5(5) QInd=[∗3302∗0320∗3022∗],QDep=[∗3302∗08h20∗8h012h12h∗],(5) ) with h = 3.

In our final simulation, we also considered cases where there were multiple transitions from the independent rate class to the dependent rate class. Here we simulated 100 trees and chose a random number of nodes in the range of [3, 8] that transitioned from the independent rate class to the dependent rate class.

and show that in the case of multiple transitions in the hidden rate class, it is still possible to identify where in the tree these transitions occur with reasonable accuracy. The results have a larger variance and a lower median accuracy compared to the same rates and the same size of tree with only a one-node transition from the independent rate class to the dependent rate class, see .

Figure 7. One example of comparing the correct matching edges between the true generating model (a) and the fitting model under the HMM (b). The proportion of matching edges is 82. 9%.

Figure 7. One example of comparing the correct matching edges between the true generating model (a) and the fitting model under the HMM (b). The proportion of matching edges is 82. 9%.

Figure 8. The percentage of the edge-length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees of a size of 140. The true generating model used rates from Equation (Equation4). Each true generating tree has between [3, 8] internodes of interest that transition from the independent rate class to the dependent rate class.

Figure 8. The percentage of the edge-length where the state of dependence or independence was correctly assigned when comparing between the true generating model and a “simmap” using the inferred transition rates in the HMM. Results are shown for 100 trees of a size of 140. The true generating model used rates from Equation (Equation4(4) QInd=[∗0.040.0400.03∗00.040.030∗0.0400.030.03∗],QDep=[∗0.040.0400.03∗00.40.030∗0.400.003750.00375∗].(4) ). Each true generating tree has between [3, 8] internodes of interest that transition from the independent rate class to the dependent rate class.

Overall, the simulation study showed that it is possible to detect that a model with two rate classes is more appropriate based on AICc scores. However, to detect exactly where these transitions occur the rate of gene gain and loss has to be quite large—perhaps unrealistically so.

3. Neofunctionalization model II

We extend the idea of Model I in Section 2 to the following more general model of neofunctionalization, referred to as Model II. Given n potential genes that may interact with one another, let S(n)=Sind(n)Sd(n),Sind(n)={(0;i1,,in):ik{0,1},k=1,,n},Sd(n)={(1;i1,,in):ik{0,1},k=1,,n}, where ik=1,0 corresponds to the presence or absence of gene k, respectively; while (1;i1,,in) and (0;i1,,in) denotes the dependence and independence of the genes.

Assume as before that, given state s=(0;i1,,in), gene k=1,,n may be gained or lost at a Poisson rate αk and βk, respectively. On the other hand, given a state s=(1;i1,,in), gene k=1,,n may be gained or lost at a Poisson rate αkr¯s and βkrs, respectively.

Suppose that an additional gene may be added to the functional pathway at some Poisson rate δ > 0. Then, given the current state (m;i1,,in), a transition to some state (m;i1,,in,1) with in+1=1 occurs at rate δ, for m=0,1.

Then, to model the evolution of genes, we construct a level-dependent Quasi-Birth-and-Death process (LD-QBD) {(X(t),φ(t)):t0} with state space (6) S={(n,φ):n=0,1,;φS(n)},(6) in which the level n is the number of genes that may potentially interact in some functional pathway, and phase φ records the information about the presence or absence of the genes, as well as dependence or independence.

Furthermore, assuming that N is the maximum number of potential genes involved in an interaction, the generator of the process {(X(t),φ(t)):t0} is matrix Q=[Q[n,n]]n,n{2,3,,N}=[Q[2,2]Q[2,3]000Q[3,3]Q[3,4]00000Q[N,N]], with the following expressions for the block matrices in 𝐐.

For n=2,,N1, the block matrix Q[n,n+1]=[q(m;i1,,in)(m;i1,,in+1)] is such that when (m;i1,,in)=(1;1,,1) then q(m;i1,,in)(m;i1,,in,1)=δ and q(m;i1,,in)(m;i1,,in+1)=0 otherwise.

For 2,,N, the block matrix Q[n,n]=[q(m;i1,,in)(m;i1,,in)] is such that Q[n,n]=[Q[,][n,n]],{0,1,,n}=[Q[0,0][n,n]Q[0,1][n,n]00Q[1,0][n,n]Q[1,1][n,n]Q[1,2][n,n]0000Q[n1,n][n,n]Q[n,n][n,n]], where the block matrix Q[,][n,n]=[q(m;i1,,in)(m;i1,,in)]i1++in=;i1++in= collects all transition rates from states with ℓ present genes (i1++in=) to states with with ℓ present genes (i1++in=).

As a simple example, matrix 𝐀 in (Equation3) corresponds to n = 2, and so, when rearranged, is equivalent to matrix 𝐐[2, 2], given by Q[2,2]=[Q[,][2,2]],{0,1,2} with Q[0,0][2,2]=[0γ],Q[0,1][2,2]=[α2α10000α2r00α1r00], and Q[1,0][2,2]=[β20β100β2r010β1r10],Q[1,1][2,2]=[000000γ000γ0],Q[1,2][2,2]=[α10α200α1r¯010α2r¯10], and Q[2,1][2,2]=[β1β20000β1r11β2r11],Q[2,2][2,2]=[λγ].

That is, Q[2,2]=[Q[0,0][2,2]Q[0,1][2,2]0Q[1,0][2,2]Q[1,1][2,2]Q[1,2][2,2]0Q[2,1][2,2]Q[2,2][2,2]]=[0α2α10000γ00α2r00α1r0000β20000α10β10000α200β2r01γ000α1r¯010β1r100γ00α2r¯1000β1β200λ0000β1r11β2r11γ].

If the maximum number of potential genes involved in an interaction is N = 2, then we simply have (9) Q=Q[2,2],(9) however, if N > 2, then 𝐐 has the more general form, as in (3), and then 𝐐[2, 3] is an 8×16 matrix, with (10) Q[2,3]=[000δ],(10) since the only possible transition in such a matrix is from state (m;i1,,in)=(1;1,,1) to state (m;i1,,in+1)=(1;1,,1,1), which occurs at rate δ.

The expressions for various metrics of interest for such defined LD-QBD can then be obtained by applying results from the literature of matrix-analytic methods, including, e.g., Ramaswami[Citation28], Joyner and Fralix[Citation14], Phung-Duc et al.[Citation27], and Samuelson et al.[Citation31]. Therefore, the analysis of the evolution of gene content and their association across a phylogeny is possible with our model proposed here.

3.1. Model where N genes become associated

We note that the process will eventually be absorbed into the set of states with N associated genes. Therefore, we now consider the generator matrix 𝐐[N, N], which by 3, is given by (11) Q[N,N]=[Q,{0,1,,N}[N,N]]=[Q[0,0][N,N]Q[0,1][N,N]000Q[1,0][N,N]Q[1,1][N,N]Q[1,2][N,N]000Q[2,1][N,N]Q[2,2][N,N]Q[2,3][N,N]0000Q[N,N1][N,N]Q[N,N][N,N]].(11)

Note that while the structure of Equation (Equation11) looks almost identical to Equation (3), the matrices 𝐐[n, n] for n < N in Equation (3) correspond to non-conservative generators, with some row sums being strictly negative due to the possibility of transitions from level n to level (n+1). On the other hand, 𝐐[N, N] in Equation (Equation11) corresponds to a conservative generator with all row sums equal to zero. The process with the generator 𝐐[N, N] can be analyzed using the results from the literature on the LD-QBDs[Citation14, Citation27, Citation29]. The stationary distribution of this process, relevant to the analysis here, is considered in the next section.

Denote by αi,βi,i{1,2,,N} the rates at which the i-th gene is gained and lost, respectively. As discussed in Section 2, we also consider the parameters that affect the transition rates within the set 𝒮d, which we denote by r,r,rN,rN>0. Parameters r,r affect the rate of gene gain and loss when ℓ < N, and we expect that the values of r,r are close to 1. Parameters rN,rN affect the rate of gene gain and loss when ℓ = N, when all the genes are present and perform some useful function, and we expect rN < 1 and rN>1.

We now describe the blocks Q[,][N,N] in the generator 𝐐[N, N]. For ℓ = 0, we have, Q[0,0][N,N]=[0γ] and (12) Q[0,1][N,N]=[αNαN1α1000000αNr0αN1r0α1r0].(12)

For ℓ = 1, (13) Q[1,0][N,N]=[βN0β100βNr10β1r1],(13) (14) Q[1,1][N,N]=[Oγ],(14) where 𝜸 is a N×N diagonal matrix with diagonal values γ. Next, assuming the order of the rows in Q[1,2][N,N] according to (0,,0,1),(0,,0,1,0),,(1,0,,0) and columns in Q[1,2][N,N] according to (1,0,,0,1),(0,1,0,0,1,0),,(1,0,,0,1), we can write Q[1,2][N,N]=[α1α2αN10000000000αNαN2α100000αN0000αn10000000000α1r1α2r1αN1r10000000000αNr1αN2r1α1r1000000αNr1000αN1r1].

For =1,2,,N1, we notice that the number of possible states of i1++in= is that total number of combinations equal to (N), and so the number of possible states to transition to such that i1++in= where =±1, is equal to (N).

We apply similar methods to write blocks Q[,][N,N], {1,,+1} for =2,,N1. For ℓ = N, we have (16) Q[N,N1][N,N]=[β1β2βN00000β1rNβNrN],(16) (17) Q[N1,N][N,N]=[α10αN00α1rN0αNrN],(17) and (18) Q[N,N][N,N]=[λγ].(18)

Assume that the gene gain or loss for each gene is equally likely. The rates of gaining or losing a gene are equal to αi = α and βi = β for all i{1,2,,N}.

3.2. Numerical results

Define the stationary distribution vector π=π=0,1,,N, π=[π(;i1,i2,iN)]i1+i2++iN=, given by (19) π(;i1,i2,iN)=limtP((X(t),φ(t))=(;i1,i2,iN)),(19) where the limits exist and do not depend on the initial state because of the stability of the process.

In order to compute 𝝅, for =0,1,,N, we apply an iterative scheme in Baumann and Sandmann[Citation5].

  1. Compute RN1=Q[N1,N]Q[N,N]1;

  2. Iterate R1=Q[1,](Q[,]+RQ[+1,])1 for =N1,N2,,1;

  3. Compute a solution 𝝅0≠𝟎 of π0(Q[0,0]+R0Q[1,0])=0;

  4. Iterate π+1=πR for =0,1,,N1;

  5. Renormailse 𝝅 using π=π=0Nπ.

We present examples of stationary distribution when N = 4. We apply the value in Equation (Equation5) when h = 1 in .

Figure 9. The stationary distribution π with =0,1,,4. Here the γ = 1, is the transition rate from the dependent rate class to the independent rate class. In (a), λ = 4 is the transition rate from the independent rate class to the dependent rate class, and in (b), λ = 2.

Figure 9. The stationary distribution πℓ with ℓ=0,1,⋯,4. Here the γ = 1, is the transition rate from the dependent rate class to the independent rate class. In (a), λ = 4 is the transition rate from the independent rate class to the dependent rate class, and in (b), λ = 2.

Now we calculate the stationary distribution using the value in Equation (Equation4) in . Since the processes have smaller rates of gene gain and loss, the transition rates from the independent rate class to the dependent rate class and from the dependent rate class to the independent rate class are both smaller.

Figure 10. The stationary distribution π with =0,1,,4. Here the γ = 0. 025, is the transition rate from the dependent rate class to the independent rate class. In (a), λ = 0. 1 is the transition rate from the independent rate class to the dependent rate class, and in (b), λ = 0. 2.

Figure 10. The stationary distribution πℓ with ℓ=0,1,⋯,4. Here the γ = 0. 025, is the transition rate from the dependent rate class to the independent rate class. In (a), λ = 0. 1 is the transition rate from the independent rate class to the dependent rate class, and in (b), λ = 0. 2.

4. Discussion

In this article, we have developed a new approach for modeling neofunctionalization across an evolutionary tree. In previous work by Diao et al.[Citation9, Citation10] neofunctionalization has been modeled in terms of a single gene acquiring a new beneficial function, here, we instead think of it in terms of groups of two or more genes gaining an association, i.e., becoming linked in a biochemical pathway that performs some new function. We applied the package corHMM from Boyko and Beaulieu[Citation7] to generate gene presence-absence data under the model and then assessed how well we were able to detect if a HMM was the best model, and if so, where on the tree transitions between the dependent and independent rate classes occurred. While we were often able to reject simpler one-rate-class models in favor of the true HMM, we were not, in general, able to get accurate knowledge of where in the tree the transitions between rate classes occurred unless rates of gene gain and loss were high. We observed a large variance in the percentage of edges where we could correctly infer the hidden rate class.

The corHMM package allows users to specify which rates should be equal, but it does not allow us to impose additional constraints on whether some rates should be larger or smaller than others. In the biological scenario, we are modeling it makes sense that rates of loss from the 11 state should be lower in 𝐐Dep than 𝐐Ind. It is possible that an inference algorithm that was able to enforce this constraint would achieve higher levels of accuracy in determining where transitions between dependence and independence occur.

In our simulations with a single transition from independence to dependence, we found that using larger trees and larger mean rates of gain and loss had better accuracy at identifying the hidden rate classes along the edges, in the best cases, the median accuracy was as high as 95%. However, using a lower mean rate, which Boyko and Beaulieu[Citation7] mentioned as leading to better performance of corHMM, gave us poorer results. Additionally, we conducted some simulations where there were multiple transitions from the independent rate class to the dependent rate class. Those simulations led to a larger variance and less accuracy, but still had a median accuracy of around 90%.

To extend to the general case where N genes become associated, we developed a level-dependent quasi-birth-and-death (LD-QBD) model. The level, n, records the current number of genes that are potentially required to perform some joint function, and the phase records which genes are present or absent. In our model, all n genes must be present for selection acting on some useful function to offer protection (reduce the rates of gene loss). The genes act like components of a series circuit, where if any one component is missing, the beneficial function is lost. One possible extension of the work is to consider more flexible configurations where there is some redundancy.

Pan-genome studies are becoming increasingly prevalent, so in the future, there will be many more data sets where gene presence-absence information is available for numerous species, e.g., for crops and their wild relatives[Citation6]. For this reason, it will be important to continue to develop stochastic models that can advance our understanding of biological processes. The models presented here are a useful step in this direction.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

We would like to thank the Australian Research Council for funding this research through Discovery Project DP180100352. This work was partly funded by The Australian Research Council Centre of Excellence for Plant Success in Nature and Agriculture (CE200100015).

References

  • Anderson, D.;Burnham, K.; 2004, Model Selection and Multi-Model Inference, Vol. 63, 2nd ed. Springer-Verlag: New York, 2004; 10.
  • Aravind, L.;Tatusov, R.L.;Wolf, Y.I.;Walker, D.R.;Koonin, E.V. Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet. 1998, 14, 442–444. doi:10.1016/s0168-9525(98)01553-4. 9825671.
  • Barker, D.;Meade, A.;Pagel, M. Constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes. Bioinformatics 2007, 23, 14–20. doi:10.1093/bioinformatics/btl558. 17090580.
  • Barker, D.;Pagel, M. Predicting functional gene links from phylogenetic-statistical analyses of whole genomes. PLoS Comput. Biol. 2005, 1, 3.
  • Baumann, H.;Sandmann, W. Numerical solution of level dependent quasi-birth-and-death processes. Procedia Comput. Sci. 2010, 1, 1561–1569. doi:10.1016/j.procs.2010.04.175.
  • Bohra, A.;Kilian, B.;Sivasankar, S.;Caccamo, M.;Mba, C.;Mccouch, S.R.;Varshney, R.K. Reap the crop wild relatives for breeding future crops. Trends Biotechnol. 2021, 40, 412–431.
  • Boyko, J.D.;Beaulieu, J.M. Generalized hidden Markov models for phylogenetic comparative datasets. Methods Ecol. Evol. 2021, 12, 468–478. doi:10.1111/2041-210X.13534.
  • Charlesworth, B.;Coyne, J.A.;Barton, N.H. The relative rates of evolution of sex chromosomes and autosomes. Amer. Natur. 1987, 130, 113–146. doi:10.1086/284701.
  • Diao, J.;M O’Reilly, M.;Holland, B. A subfunctionalisation model of gene family evolution predicts balanced tree shapes. Mol. Phylogenet. Evol. 2022, 176, 107566. doi:10.1016/j.ympev.2022.107566. PMC: 35810972.
  • Diao, J.;Stark, T.L.;Liberles, D.A.;O’Reilly, M.M.;Holland, B.R. Level-dependent QBD models for the evolution of a family of gene duplicates. Stoch. Models 2020, 36, 285–311. doi:10.1080/15326349.2019.1680296.
  • Fernández, R.;Gabaldón, T. Gene gain and loss across the metazoan tree of life. Nat. Ecol. Evol. 2020, 4, 524–533. doi:10.1038/s41559-019-1069-x. 31988444.
  • Francis, A.R.;Tanaka, M.. Evolution of variation in presence and absence of genes in bacterial pathways. BMC Evol. Biol. 2012, 12, 1–11.
  • Innan, H.;Kondrashov, F. The evolution of gene duplications: Classifying and distinguishing between models. Nat. Rev. Genet. 2010, 11, 97–108. doi:10.1038/nrg2689. 20051986.
  • Joyner, J.;Fralix, B. A new look at Markov processes of G / M /1-type. Stoch. Models 2016, 32, 253–274. doi:10.1080/15326349.2015.1115363.
  • Lake, J.A.;Rivera, M.C. Deriving the genomic tree of life in the presence of horizontal gene transfer: Conditioned reconstruction. Mol. Biol. Evol. 2004, 21, 681–690. doi:10.1093/molbev/msh061. 14739244.
  • Latouche, G.;Ramaswami, V. Introduction to matrix analytic methods in stochastic modeling. In ASA-SIAM Series on Statistics and Applied Mathematics. Society for Industrial and Applied Mathematics: Philadelphia, PA, 1999.
  • Lynch, M. The evolution of multimeric protein assemblages. Mol. Biol. Evol. 2012, 29, 1353–1366. doi:10.1093/molbev/msr300. 22144639.
  • Lynch, M. Evolutionary diversification of the multimeric states of proteins. Proc. Natl. Acad. Sci. 2013, 110, 2821–2828.
  • Marazzi, B.;Ané, C.;Simon, M.F.;Delgado-Salinas, A.;Luckow, M.;Sanderson, M.J. Locating evolutionary precursors on a phylogenetic tree. Evolution 2012, 66, 3918–3930. doi:10.1111/j.1558-5646.2012.01720.x. 23206146.
  • Monson, R.K. Gene duplication, neofunctionalization, and the evolution of c4 photosynthesis. Int. J. Plant Sci. 2003, 164, 43–54.
  • Neuts, M.F. Matrix-geometric solutions in stochastic models: An algorithmic approach. In Johns Hopkins Series in the Mathematical Sciences, Vol. 2. Johns Hopkins University Press: Baltimore, 1981.
  • Ohta, T. Simulating evolution by gene duplication. Genetics 1987, 115, 207–213. doi:10.1093/genetics/115.1.207. 3557113.
  • Orr, H. A. The population genetics of beneficial mutations. Philos. Trans. R Soc. Lond. B Biol. Sci. 2010, 365, 1195–1201. doi:10.1098/rstb.2009.0282. 20308094.
  • Pagel, M. Detecting correlated evolution on phylogenies: A general method for the comparative analysis of discrete characters. Proc. Royal Soc. London. Ser. B: Biological Sciences, 1342, 255, 37–45.
  • Pagel, M. Inferring the historical patterns of biological evolution. Nature 1999, 401, 877–884. doi:10.1038/44766. 10553904.
  • Pennell, M.W.;Eastman, J.M.;Slater, G.J.;Brown, J.W.;Uyeda, J.C.;FitzJohn, R.G.;Alfaro, M.E.;Harmon, L.J. Geiger v2.0: An expanded suite of methods for fitting macroevolutionary models to phylogenetic trees. Bioinformatics 2014, 30, 2216–2218. doi:10.1093/bioinformatics/btu181. 24728855.
  • Phung-Duc, T.;Masuyama, H.; Kasahara.; S.;Takahashi, Y. A simple algorithm for the rate matrices of level-dependent QBD processes. In Proceedings of the 5th International Conference on Queueing Theory and Network Applications: ACM: New York, USA, 2010; 46–52.
  • Ramaswami, V. Matrix-analytic methods: A tutorial overview with some extensions and new results. In Matrix-Analytic Methods in Stochastic Models; Chakravarthy, S., Alfa, A., Eds.; Marcel Dekker: New York, 1997; 261–296.
  • Ramaswami, V. Matrix analytic methods for stochastic fluid flows. In ITC16: International Teletraffic Congress: Edinburgh, Scotland, June 1999; 1019–1030.
  • Roth, C.;Rastogi, S.;Arvestad, L.;Dittmar, K.;Light, S.;Ekman, D.;Liberles, D.A. Evolution after gene duplication: Models, mechanisms, sequences, systems, and organisms. J. Exp. Zool. B Mol. Dev. Evol. 2007, 308, 58–73. doi:10.1002/jez.b.21124. 16838295.
  • Samuelson, A.;O’Reilly, M.;Bean, N. Construction of algorithms for discrete-time quasi-birth-and-death processes through physical interpretation. Stoch. Models 2020, 36, 193–222. doi:10.1080/15326349.2020.1744451.
  • Smith, J.M.;Haigh, J. The Hitch-Hiking effect of a favourable gene. Genet. Res. 1974, 23, 23–35. doi:10.1017/S0016672300014634.