Full article: On calculating the probability of a set of orthologous sequences

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Probabilistic DNA sequence models have been intensively applied to genome research. Within the evolutionary biology framework, this article investigates the feasibility for rigorously estimating the probability of a set of orthologous DNA sequences which evolve from a common progenitor. We propose Monte Carlo integration algorithms to sample the unknown ancestral and/or root sequences a posteriori conditional on a reference sequence and apply pairwise Needleman–Wunsch alignment between the sampled and nonreference species sequences to estimate the probability. We test our algorithms on both simulated and real sequences and compare calculated probabilities from Monte Carlo integration to those induced by single multiple alignment.

Keywords:

Introduction

Comparative genomics/proteomics research often focuses on a set of orthologous sequences arising from evolutionary speciation. For example, multiple related species (for example, human, mouse, and rat) can have a common gene as well as the corresponding promoters in the upstream region of such a gene, although these matched sequences may have minor difference across species. For simplicity the set of sequences studied in the sequel are assumed to have almost equal length in light of these examples. Sequence alignment algorithms^Citation1 have substantially facilitated comparative genomics/proteomics research by showing conservation pattern along orthologous sequences, and biologically functional segments are likely to be those more conserved regions along the genome. For the vast body of related literature, we refer to Liu and colleagues,^Citation2 Kellis and colleagues,^Citation3 Moses and colleagues,^Citation4 Xie and colleagues,^Citation5 Wei and Jensen,^Citation6 Sinha and He,^Citation7 and many others. As another major tool, statistical modeling approaches are devoted to comprehensively describing the probabilistic uncertainties linked to those established biological evolution models which may include two topological structures: parallel and phylogenic models (see ).

Figure 1 Evolution models.

The joint parallel evolution process probability Pr(Ancestor, Species 1, 2, and 3) is

Pr (Ancestor) \prod_{i = 1}^{3} Pr (Species i ∣ Ancestor),

(1)

and the joint phylogenic evolution process probability Pr(Root, Ancestor, Species 1, 2, 3) is

Pr (Root) Pr (Species 1 ∣ Root) Pr (Ancestor ∣ Root) \prod_{i = 2}^{3} Pr (Species i ∣ Ancestor) .

(2)

Jukes and Cantor^Citation8 proposed the first probabilistic nucleotide evolution model which assumes substitution to take place randomly among four types of nucleotides “A[1]T[2]C[3]G[4]”. The transition (from nucleotide i to nucleotide j) probability up to time t is derived as

\begin{array}{l} p_{i j} (t) = (1 / 4) + (3 / 4) e^{- 4 α t} for (i = j), and \\ (1 / 4) - (1 / 4) e^{- 4 α t} for (i \neq j) . \end{array}

(3)

We assume that the substitution rate parameter (α) is constant for different species and the evolution duration (t) is represented by specific time period (t₀, t₁, t₂, or t₃) for the associated divergence process (see ). Our question is how to effectively estimate the marginal probability for the given orthologous species sequence set without knowing the genotype of the ancestor and/or root.

Material and methods

For the unknown ancestor sequence, we simply assume that the nucleotide on any site follows a tetranomial distribution with categories {ATCG} and equal proportion, (1/4). We further assume that each nucleotide on the ancestor sequence evolves independently (under the probability law, EquationEq. (3) $\begin{array}{l} p_{i j} (t) = (1 / 4) + (3 / 4) e^{- 4 α t} for (i = j), and \\ (1 / 4) - (1 / 4) e^{- 4 α t} for (i \neq j) . \end{array}$ (3) ), so that each species sequence is a series of nucleotides which follow another tetranomial distribution identically and independently. The state space is {ATCG} and the state proportions are (P_A, P_T, P_C, P_G) which can be calculated by

p_{j} = \sum_{i = 1}^{4} (1 / 4) p_{i j} (t) = 1 / 4, i, j \in {1234} = {A T C G} .

(4)

Thus, each species’ nucleotide follows the same tetranomial distribution as the ancestor nucleotide. Under the independence assumption, the probability for the species sequence is simply a product of all nucleotide marginal probabilities, (1/4). This formulation can also be used to sample the unknown ancestor state among {ATCG} given the reference species state j = (A, T, C or G), since the posterior distribution among {ATCG} for the unknown ancestor state can be easily derived to be

p (i ∣ j) = p_{i j} (t) / \sum_{k = 1}^{4} p_{k j} (t), i = 1, 2, 3, 4.

(5)

We now briefly investigate the ambiguity extent to which different sources of sequence are aligned. For simplicity, we use the Jukes and Cantor^Citation8 model and assume the ancestor vs species nucleotide identity (“ancestor = species”) probability is

1 - 3 p,

(6)

which equals p_ii (t), in EquationEq. (3) $\begin{array}{l} p_{i j} (t) = (1 / 4) + (3 / 4) e^{- 4 α t} for (i = j), and \\ (1 / 4) - (1 / 4) e^{- 4 α t} for (i \neq j) . \end{array}$ (3) , the substitution probability is p, which equals p_ij (t) in EquationEq. (3) $\begin{array}{l} p_{i j} (t) = (1 / 4) + (3 / 4) e^{- 4 α t} for (i = j), and \\ (1 / 4) - (1 / 4) e^{- 4 α t} for (i \neq j) . \end{array}$ (3) for i ≠ j. The identity probability between two species (“species = species”) nucleotides with equal evolution duration is thus

{(1 - 3 p)}^{2} + 3 p^{2} .

(7)

The statistical sequence evolution model works on probabilistic transition from the ancestor nucleotide to species nucleotide. Since the ancestor sequence is never known for a direct alignment, we may sample it a posteriori given the reference species nucleotide. The probability for event “X”, nucleotide identity between such a posterior ancestor nucleotide and another species’ nucleotide (“posterior ancestor = species”) other than the reference species nucleotide, is derived as

\begin{array}{l} Pr (reference species = another species) \\ \times Pr (X ∣ reference species = another species) \\ + Pr (reference species \neq another species) \\ \times Pr (X ∣ reference species \neq another species) \\ = ({(1 - 3 p)}^{2} + 3 p^{2}) (1 - 3 p) + (1 - {(1 - 3 p)}^{2} - 3 p^{2}) p \\ = - 48 p^{3} + 36 p^{2} - 9 p + 1. \end{array}

(8)

The three identity probabilities (EquationEqs. (6) $1 - 3 p,$ (6) , Equation(7) ${(1 - 3 p)}^{2} + 3 p^{2} .$ (7) , and Equation(8) $\begin{array}{l} Pr (reference species = another species) \\ \times Pr (X ∣ reference species = another species) \\ + Pr (reference species \neq another species) \\ \times Pr (X ∣ reference species \neq another species) \\ = ({(1 - 3 p)}^{2} + 3 p^{2}) (1 - 3 p) + (1 - {(1 - 3 p)}^{2} - 3 p^{2}) p \\ = - 48 p^{3} + 36 p^{2} - 9 p + 1. \end{array}$ (8) ) are plotted in where the ancestor-to-species transition probability (p) varies. We find that, the identity probabilities for these three types of matched nucleotides follow the order

Figure 2 Identity probabilities for matched nucleotide pair (“ancestor vs species,” “species vs species” and “posterior ancestor vs species”).

\begin{array}{l} Pr (ancestor = species) > Pr (species = species) \\ > Pr (posterior ancestor = species), \end{array}

(6) > (7) > (8) .

Thus, alignment between the ancestor and species sequences may be less ambiguous than between two species sequences. also indicates that, the difference among these three types of identity probability is most significant in the middle of interval (0, 1/4). The dominance of “Pr(ancestor = species) over Pr (reference species = another species)” is more significant than that of “Pr (reference species = another species) over Pr (posterior ancestor = another species)”, and the difference between the lower two curves () seems not to be relatively large. Thus pairwise alignment between the posterior ancestor and another species sequence may achieve similar unambiguity as alignment between species.

Now we study a multispecies orthologous sequence set (say Human = Species 1, Mouse = Species 2, and Rat = Species 3 in ). We denote by B_a, B₁, B₂, and B₃ the sequences of the unknown ancestor, Species 1, 2, and 3. Under the nucleotide substitution model and unambiguous matching, the probability for the set of sequences under parallel evolution is

\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = \sum_{B_{a}} P (B_{1}, B_{2}, B_{3} ∣ B_{a}) π (B_{a}) \\ = \sum_{B_{a}} P (B_{1} ∣ B_{a}) P (B_{2} ∣ B_{a}) \\ \times P (B_{3} ∣ B_{a}) π (B_{a}), \end{array}

(9)

where π (·) is the identical and independent tetranomial distribution for the ancestor nucleotide with state space {ATCG} and equal (1/4) proportion. The result is obtained by integrating out four possible ancestor nucleotides on each site for a marginal nucleotide group (three members across species) probability and multiplying these individual marginal probabilities along the sequence. Similarly, the phylogenic evolution model requires integrating out both the ancestor and root nucleotides on each site to get the result. Note that multiple alignment is not needed under the substitution model since no gaps are allowed. For general nucleotide substitution–insertion–deletion, the probabilistic evolution model developed by Rivas^Citation9 gives the overall “substitution, insertion, and deletion” probabilities from the ancestor to species given divergence time. Calculating the evolution probability from the ancestor (assumed to be known) to the observed species sequence using the Rivas^Citation9 model may require multiple alignments up front in order to match those nucleotides between the ancestor and species. Aside from not knowing the ancestor sequence, unambiguous alignment may not exist due to moderate sequence divergence.^Citation7 Thus, one can underestimate the sequence set probability which is induced in a similar way to EquationEq. (9) $\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = \sum_{B_{a}} P (B_{1}, B_{2}, B_{3} ∣ B_{a}) π (B_{a}) \\ = \sum_{B_{a}} P (B_{1} ∣ B_{a}) P (B_{2} ∣ B_{a}) \\ \times P (B_{3} ∣ B_{a}) π (B_{a}), \end{array}$ (9) because it simply picks one alignment to calculate the sequence set probability without incorporating other possible alignments. Ignoring ambiguous alignment may also lead to incorrect phylogenic inference and/or misleading sequence taxa partition pattern.^{Citation10,Citation11} Under a moderate sequence length (~100 nucleotides), a tetranomial distribution for each nucleotide along the ancestor sequence may be used to sample ancestor nucleotides independently to form a large number (n) of sequences which are further used to induce a probabilistically evolved set of species sequences based on the Rivas^Citation9 model. The sequence set probability is simply estimated as the number of exact duplicates of the given sequence set divided by n. However, this is highly impractical under moderate sequence length due to the small chance of duplicate sequence sets. Another possible way is applying pairwise alignment between each sampled ancestor sequence and observed species-specific sequences, and the sequence set probability may be done by averaging these evolution probabilities over all sampled ancestor sequences.

However, this may also be inefficient due to non-informative ancestor sampling and lack of reliable alignment between a random sequence and species sequences. Thus it becomes desirable to propose and investigate more efficient multiple-imputation-like approaches such as using posterior ancestor samples which may offer multiple representative alignment results conditional on a reference species sequences for sequence set probability elicitation. Instead of following the theme of EquationEq. (9) $\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = \sum_{B_{a}} P (B_{1}, B_{2}, B_{3} ∣ B_{a}) π (B_{a}) \\ = \sum_{B_{a}} P (B_{1} ∣ B_{a}) P (B_{2} ∣ B_{a}) \\ \times P (B_{3} ∣ B_{a}) π (B_{a}), \end{array}$ (9) , we turn to calculating P (B₁, B₂, B₃) in an alternative way under parallel evolution

\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = P (B_{2}, B_{3} ∣ B_{1}) P (B_{1}) \\ = [\sum_{B_{a}} P (B_{2}, B_{3} ∣ B_{a}) P (B_{a} ∣ B_{1})] \\ \times [\sum_{B_{a}} P (B_{1} ∣ B_{a}) π (B_{a})] \\ = [\sum_{B_{a}} P (B_{2} ∣ B_{a}) P (B_{3} ∣ B_{a}) P (B_{a} ∣ B_{1})] \\ [\sum_{B_{a}} P (B_{1} ∣ B_{a}) π (B_{a})] \\ = [I] \times [I I], \end{array}

(10)

where [II] is obtained after multiplicity over all nucleotide marginal probabilities for Species 1 (see EquationEq. (4) $p_{j} = \sum_{i = 1}^{4} (1 / 4) p_{i j} (t) = 1 / 4, i, j \in {1234} = {A T C G} .$ (4) ). As for [I], since B_a (posterior ancestor) is sampled from the reference sequence (offspring) B₁ and the integrand is the offspring (B₂, B₃) probability derived from the representative ancestor B_a which is already linked to offspring B₁ through posterior sampling. Monte Carlo integration introduced in EquationEq. (10) $\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = P (B_{2}, B_{3} ∣ B_{1}) P (B_{1}) \\ = [\sum_{B_{a}} P (B_{2}, B_{3} ∣ B_{a}) P (B_{a} ∣ B_{1})] \\ \times [\sum_{B_{a}} P (B_{1} ∣ B_{a}) π (B_{a})] \\ = [\sum_{B_{a}} P (B_{2} ∣ B_{a}) P (B_{3} ∣ B_{a}) P (B_{a} ∣ B_{1})] \\ [\sum_{B_{a}} P (B_{1} ∣ B_{a}) π (B_{a})] \\ = [I] \times [I I], \end{array}$ (10) realistically implements the joint probability of multiple post-evolution sequences by working on pairwise alignments between the sampled ancestor sequence and observed species-specific sequences. Under phylogenetic (tree-structured) evolution (the right panel in ), the sequence set probability can be written as

\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = P (B_{2}, B_{3} ∣ B_{1}) P (B_{1}) \\ = [\sum_{B_{r}, B_{a}} P (B_{2}, B_{3} ∣ B_{r}, B_{a}) P (B_{r}, B_{a} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{2}, B_{3} ∣ B_{r}, B_{a}) P (B_{a} ∣ B_{r}) P (B_{r} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{2}, B_{3} ∣ B_{a}) P (B_{a} ∣ B_{r}) P (B_{r} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{2} ∣ B_{a}) P (B_{3} ∣ B_{a}) P (B_{a} ∣ B_{r}) P (B_{r} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [I] \times [I I], \end{array}

(11)

where B₁ is Species 1 (Human) sequence, B_r is the root sequence, and B_a is the ancestor sequence in . However, if we use Species 2 (Mouse) sequence B₂ as the reference sequence, then we have the following decomposition

\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = P (B_{1}, B_{3} ∣ B_{2}) P (B_{2}) \\ = [\sum_{B_{r}, B_{a}} P (B_{1}, B_{3} ∣ B_{r}, B_{a}) P (B_{r}, B_{a} ∣ B_{2})] \\ \times [\sum_{B_{r}} P (B_{2} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{1} ∣ B_{r}) P (B_{3} ∣ B_{a}) P (B_{r}, B_{a} ∣ B_{2})] \\ \times [\sum_{B_{r}} P (B_{2} ∣ B_{r}) π (B_{r})] \end{array}

(12)

Note that,

P (B_{r}, B_{a} ∣ B_{2}) = P (B_{r} ∣ B_{a}) (B_{a} ∣ B_{2})

and

\sum_{B_{r}} P (B_{2} ∣ B_{r}) π (B_{r}) = \sum_{B_{r}} [\sum_{B_{a}} P (B_{2} ∣ B_{a}) P (B_{a} ∣ B_{r})] π (B_{r}) .

(13)

As in EquationEq. (4) $p_{j} = \sum_{i = 1}^{4} (1 / 4) p_{i j} (t) = 1 / 4, i, j \in {1234} = {A T C G} .$ (4) , we assign 1/4 to the probability for each nucleotide along the reference sequence B₂ after applying EquationEq. (13) $\sum_{B_{r}} P (B_{2} ∣ B_{r}) π (B_{r}) = \sum_{B_{r}} [\sum_{B_{a}} P (B_{2} ∣ B_{a}) P (B_{a} ∣ B_{r})] π (B_{r}) .$ (13) . Only pairwise alignment between the posterior ancestor sequence and species sequence is used for Monte Carlo integration (EquationEqs. (10) $\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = P (B_{2}, B_{3} ∣ B_{1}) P (B_{1}) \\ = [\sum_{B_{a}} P (B_{2}, B_{3} ∣ B_{a}) P (B_{a} ∣ B_{1})] \\ \times [\sum_{B_{a}} P (B_{1} ∣ B_{a}) π (B_{a})] \\ = [\sum_{B_{a}} P (B_{2} ∣ B_{a}) P (B_{3} ∣ B_{a}) P (B_{a} ∣ B_{1})] \\ [\sum_{B_{a}} P (B_{1} ∣ B_{a}) π (B_{a})] \\ = [I] \times [I I], \end{array}$ (10) , Equation(11) $\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = P (B_{2}, B_{3} ∣ B_{1}) P (B_{1}) \\ = [\sum_{B_{r}, B_{a}} P (B_{2}, B_{3} ∣ B_{r}, B_{a}) P (B_{r}, B_{a} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{2}, B_{3} ∣ B_{r}, B_{a}) P (B_{a} ∣ B_{r}) P (B_{r} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{2}, B_{3} ∣ B_{a}) P (B_{a} ∣ B_{r}) P (B_{r} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{2} ∣ B_{a}) P (B_{3} ∣ B_{a}) P (B_{a} ∣ B_{r}) P (B_{r} ∣ B_{1})] \\ \times [\sum_{B_{r}} P (B_{1} ∣ B_{r}) π (B_{r})] \\ = [I] \times [I I], \end{array}$ (11) , and Equation(12) $\begin{array}{l} P (B_{1}, B_{2}, B_{3}) = P (B_{1}, B_{3} ∣ B_{2}) P (B_{2}) \\ = [\sum_{B_{r}, B_{a}} P (B_{1}, B_{3} ∣ B_{r}, B_{a}) P (B_{r}, B_{a} ∣ B_{2})] \\ \times [\sum_{B_{r}} P (B_{2} ∣ B_{r}) π (B_{r})] \\ = [\sum_{B_{r}, B_{a}} P (B_{1} ∣ B_{r}) P (B_{3} ∣ B_{a}) P (B_{r}, B_{a} ∣ B_{2})] \\ \times [\sum_{B_{r}} P (B_{2} ∣ B_{r}) π (B_{r})] \end{array}$ (12) ). Since the probability of a sequence evolving from an ancestor is obtained by multiplying over all individual nucleotide evolution probabilities along a sequence, a large sequence length (say >100) may result in an overly small probability and lead to numerical overflow. The log-probability (LogPr) for a species evolutionary sequence from the ancestor is the summation of individual nucleotide evolutionary log-probabilities, and the evolutionary probability expectation obtained from Monte Carlo integration (exp(LogPr) mean), can be implemented by using moment generating function with argument one. Normality of these randomly produced LogPrs leads to the simple result of exp (μ + σ²/2) where μ and σ² are the sample mean and variance for these LogPrs.

Simulation and real data study

We first introduce in detail the extended Jukes and Cantor model by Rivas^Citation9 which will be used for our simulation study. The transition probabilities among general states {–ATCG} (“–” is the gap or covalent bond between two nucleotides) until time t are

[\begin{array}{l} - & A & T & C & G \\ - & σ (t) & ξ (t) & ξ (t) & ξ (t) & ξ (t) \\ A & γ (t) & r (t) & s (t) & s (t) & s (t) \\ T & γ (t) & s (t) & r (t) & s (t) & r (t) \\ C & γ (t) & s (t) & s (t) & r (t) & s (t) \\ G & γ (t) & s (t) & s (t) & s (t) & r (t) \end{array}],

(14)

where the {–ATCG} column to the left of the transition probability matrix represents the initial (ancestor) states and the {–ATCG} row on top of the matrix represents the final (species) states. Specifically,

{\begin{array}{l} r (t) = (1 / 4) e^{- β t} + (3 / 4) e^{- (4 α + β) t}; \\ S (t) = (1 / 4) e^{- β t} - (1 / 4) e^{- (4 α + β) t}; \\ γ (t) = 1 - e^{- β t}; \\ ξ (t) = (1 / 4) (1 - q_{0}) e^{- β t}; \\ σ (t) = 1 - (1 - q_{0}) e^{- β t} . \end{array}

(15)

For these generalized transition probabilities, we refer to the notations from the substitution model (EquationEq. (3) $\begin{array}{l} p_{i j} (t) = (1 / 4) + (3 / 4) e^{- 4 α t} for (i = j), and \\ (1 / 4) - (1 / 4) e^{- 4 α t} for (i \neq j) . \end{array}$ (3) ) and denote the element (u, v) in the matrix (EquationEq. (14) $[\begin{array}{l} - & A & T & C & G \\ - & σ (t) & ξ (t) & ξ (t) & ξ (t) & ξ (t) \\ A & γ (t) & r (t) & s (t) & s (t) & s (t) \\ T & γ (t) & s (t) & r (t) & s (t) & r (t) \\ C & γ (t) & s (t) & s (t) & r (t) & s (t) \\ G & γ (t) & s (t) & s (t) & s (t) & r (t) \end{array}],$ (14) ) to be p _−1, _v₋₁(t), where u (row index) and v (column index) = 1, 2, 3, 4, 5. Parameter 0 < q₀ ≤ 1 controls the background (nongap) frequency at time t. Specifically, letting β = 0 leads to the original Jukes and Cantor model (EquationEq. (3) $\begin{array}{l} p_{i j} (t) = (1 / 4) + (3 / 4) e^{- 4 α t} for (i = j), and \\ (1 / 4) - (1 / 4) e^{- 4 α t} for (i \neq j) . \end{array}$ (3) ) and q₀ = 1 excludes nucleotide insertion. Since each pair of neighboring ancestor nucleotides holds a potential insertion site (gap, “–”) with an overall “gap:nongap” ratio of one, we assume a pentanomial distribution for general ancestor nucleotide states with sample space {–[0]A[1]T[2]C[3]G[4]} and normalized probability set (p₀ = 1/2, p₁ = p₂ = p₃ = p₄ = 1/8). This assumption is useful for sampling the posterior ancestor state among {–ATCG} given the reference species state (–, A, T, C or G). If we denote the general species nucleotide state to be J ε {–ATCG}, then the posterior distribution among {–ATCG} for the unknown ancestor state is

P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.

(16)

Parallel evolution model

We refer to the left panel of .

Simulate the ancestor sequence with length = L₀;
Simulate species “1, 2, 3” sequences from this simulated ancestor sequence;
Apply Monte Carlo integration to randomly produced log(evolution probabilities) for Species 2 and 3 conditional on Species 1 sequence, where the unknown ancestor sequences are sampled using EquationEq. (16) $P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.$ (16) with corresponding divergence time;
As a numerical verification, we apply Monte Carlo integration to randomly produced log(evolution probabilities) for Species 1 and 3 conditional on Species 2 sequence, where the unknown ancestor sequences are sampled using EquationEq. (16) $P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.$ (16) with corresponding divergence time;
As another numerical verification, we apply Monte Carlo integration to randomly produced log(evolution probabilities) for Species 1 and 2 conditional on Species 3 sequence, where the unknown ancestor sequences are sampled using EquationEq. (16) $P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.$ (16) with corresponding divergence time;
We investigate the consistency among different references.
Various divergence time vector (t₁, t₂, t₃) in the left panel of and transition parameter (β and q₀ in EquationEq. (15) ${\begin{array}{l} r (t) = (1 / 4) e^{- β t} + (3 / 4) e^{- (4 α + β) t}; \\ S (t) = (1 / 4) e^{- β t} - (1 / 4) e^{- (4 α + β) t}; \\ γ (t) = 1 - e^{- β t}; \\ ξ (t) = (1 / 4) (1 - q_{0}) e^{- β t}; \\ σ (t) = 1 - (1 - q_{0}) e^{- β t} . \end{array}$ (15) ) configurations are given in , where transition parameter (α in EquationEq. (15) ${\begin{array}{l} r (t) = (1 / 4) e^{- β t} + (3 / 4) e^{- (4 α + β) t}; \\ S (t) = (1 / 4) e^{- β t} - (1 / 4) e^{- (4 α + β) t}; \\ γ (t) = 1 - e^{- β t}; \\ ξ (t) = (1 / 4) (1 - q_{0}) e^{- β t}; \\ σ (t) = 1 - (1 - q_{0}) e^{- β t} . \end{array}$ (15) ) is standardized into one unit.
Table 1 Simulation configurations (parallel [PA] and phylogenic [PH] evolution models, α = 1.0)
Download CSV Display Table

Phylogenic evolution model

We refer to the right panel of .

Simulate root sequence with length = L₀ and the evolved ancestor sequence for Species 2 and 3;
Simulate Species 1 sequence from this simulated root sequence, and simulate the Species 2 and 3 sequences from this simulated ancestor sequence;
Apply Monte Carlo integration to randomly produced log(evolution probabilities) for Species 2 and 3 conditional on Species 1 sequence, where the unknown root and ancestor sequences are sampled using EquationEq. (16) $P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.$ (16) with corresponding divergence times;
As a numerical verification, we apply Monte Carlo integration to randomly produced log(evolution probabilities) for Species 1 and 3 conditional on Species 2 sequence, where the unknown root and ancestor sequences are sampled using EquationEq. (16) $P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.$ (16) with corresponding divergence times;
As another numerical verification, we apply Monte Carlo integration to randomly produced log(evolution probabilities) for Species 1 and 2 conditional on Species 3 sequence, where the unknown root and ancestor sequences are sampled using EquationEq. (16) $P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.$ (16) with corresponding divergence times;
We investigate the consistency among different references.
Various divergence time vector (t₀, t₁, t₂, t₃) in the right panel of and transition parameter (β and q₀ in EquationEq. (15) ${\begin{array}{l} r (t) = (1 / 4) e^{- β t} + (3 / 4) e^{- (4 α + β) t}; \\ S (t) = (1 / 4) e^{- β t} - (1 / 4) e^{- (4 α + β) t}; \\ γ (t) = 1 - e^{- β t}; \\ ξ (t) = (1 / 4) (1 - q_{0}) e^{- β t}; \\ σ (t) = 1 - (1 - q_{0}) e^{- β t} . \end{array}$ (15) ) configurations are given in , where transition parameter (α in EquationEq. (15) ${\begin{array}{l} r (t) = (1 / 4) e^{- β t} + (3 / 4) e^{- (4 α + β) t}; \\ S (t) = (1 / 4) e^{- β t} - (1 / 4) e^{- (4 α + β) t}; \\ γ (t) = 1 - e^{- β t}; \\ ξ (t) = (1 / 4) (1 - q_{0}) e^{- β t}; \\ σ (t) = 1 - (1 - q_{0}) e^{- β t} . \end{array}$ (15) ) is standardized into one unit. We use the same transition parameter (β and q₀ in EquationEq. (15) ${\begin{array}{l} r (t) = (1 / 4) e^{- β t} + (3 / 4) e^{- (4 α + β) t}; \\ S (t) = (1 / 4) e^{- β t} - (1 / 4) e^{- (4 α + β) t}; \\ γ (t) = 1 - e^{- β t}; \\ ξ (t) = (1 / 4) (1 - q_{0}) e^{- β t}; \\ σ (t) = 1 - (1 - q_{0}) e^{- β t} . \end{array}$ (15) ) from the parallel model simulation and make evolution divergence times comparable between the parallel and phylogenic models.

We collect LogPrs from 1000 Monte Carlo simulations. The distribution of these LogPrs are plotted in , , and . A Kolmogorov–Smirnov normality test gives p-value (>0.15) for all LogPr sets, which means that the difference between the produced LogPrs and a normally distributed random variable is not significant. The normality assumption for LogPrs holds and the probability approximation based on log-normal distribution is reasonable. For such an assumption, a heuristic justification without rigorous theoretical proof is as follows: Given each randomly produced ancestor sequence, each nucleotide (event) LogPr on the non-reference sequences acts as an independently and identically distributed random variable, and the summation of these LogPrs follows the central limit theorem for a large sample size (sequence length). By increasing the ancestor or root length from 100 to 500, we can see that the relationship between LogPr and the sequence length is approximately linear. Another observation from and is that different reference species sequences may lead to inconsistent sequence set probabilities due to different evolution durations and/or topological locations within the phylogenic structure. The phylogenic evolution model (the right panel of ) seems to show more inconsistency than the parallel evolution model (the left panel of ) does due to the dual missing sequences (the root and ancestor) instead of ancestor only in the parallel evolution model. A reference sequence which is closer to the root and/or ancestor is preferable since the imputed multiple roots and/or ancestors tend to be more informative due to shorter divergence. The CLUSTAL W multiple alignment^Citation12- induced probability is obtained by moving along the sequence set which holds nucleotides {ATCG} and possible gaps (covalent bonds) and applying Rivas^Citation9 model, where “one gap with two nucleotides” across three matched sequence sites stands for a deletion and “two gaps with one nucleotide” across three matched sequence sites stands for an insertion. For each simulated sequence set, the discrepancy between Monte Carlo integration (MCI) and single multiple alignment (MA) induced probabilities are clearly more significant than that among probabilities estimated from different reference species sequences.

Figure 3 Log(probability) densities from parallel model 1–6.

Notes: Solid line, Species 1; dashed line, Species 2; dotted line, Species 3.

Figure 4 Log(probability) densities from parallel model 7–12.

Notes: Solid line, Species 1; dashed line, Species 2; dotted line, Species 3.

Figure 5 Log(probability) densities from phylogenic model 1–6.

Notes: Solid line, Species 1; dashed line, Species 2; dotted line, Species 3.

Figure 6 Log(probability) densities from phylogenic model 7–12.

Notes: Solid line, Species 1; dashed line, Species 2; dotted line, Species 3.

Table 2 Computation results (parallel evolution models)

Download CSV Display Table

Table 3 Computation results (phylogenic evolution model)

Download CSV Display Table

CREB promoters study

From the ABS database,^Citation13 we extracted the promoter regions of transcription factor CREB for three mammals (human, mouse and rat). We used MEGA 4.1 package^Citation14 to construct the phylogeny tree with corresponding divergence times under uniform transition rate 1 (see ). These are used for sampling the posterior ancestor and root.

Figure 7 Phylogeny tree for orthologous CREB promoters.

Since no current packages give us β and q₀ maximum likelihood estimation for Rivas^Citation9 model (EquationEq. (14) $[\begin{array}{l} - & A & T & C & G \\ - & σ (t) & ξ (t) & ξ (t) & ξ (t) & ξ (t) \\ A & γ (t) & r (t) & s (t) & s (t) & s (t) \\ T & γ (t) & s (t) & r (t) & s (t) & r (t) \\ C & γ (t) & s (t) & s (t) & r (t) & s (t) \\ G & γ (t) & s (t) & s (t) & s (t) & r (t) \end{array}],$ (14) ), we mainly investigate the sequence set probability sensitivity to β and q₀ input by trying different values. We report the means and variances as well as estimated Log(sequence set probabilities) under two parameter settings for β and q₀ under different reference species. We also report the p-values from Kolmogorov–Smirnov normality test (). The normality test results are sensitive to parameter input and reference species selection, which may be due to the fact that conservation levels/transition probabilities are likely to be nonhomogeneous along the sequences. Two LogPr distributions under associated parameter inputs are plotted in . As a verification, we apply multiple alignment to these three promoters and at each site we calculate the nucleotide identity proportion within the window (with size 23) starting from this site at the gene direction (). The conservation levels show that evolutionary transition rates are approximately constant piece-wisely, thus the central limit theorem discussed for the simulated data study may still apply to these LogPrs on each promoter segment with quasi-constant conservation level under certain nucleotide insertion-deletion parameter (β, q₀) values.

Figure 8 Distribution of LogPrs (CREB promoters).

Figure 9 Nucleotide identity proportion along the upstream promoter regions for transcription factor CREB.

Table 4 Computation results for CREB promoter sequence set (phylogenic evolution models)

Download CSV Display Table

Discussion

We proposed and investigated some promising numerical algorithms for accurately estimating the probability of a set of orthologous sequences with equal length under certain assumptions. Our approach was to informatively shuffle the unknown ancestors and/or roots and to find the distributional characteristics of simulated log-probabilities in order to reasonably approximate the true probability. The merit of our approach depends on how well the ancestor and/or root is imputed based on certain pentanomial distribution proportions (p₋, p_A, p_T, p_C, p_G) in EquationEq. (16) $P (i ∣ j) = p_{i} p_{i j} (t) / \sum_{k = 0}^{4} p_{k} p_{k j} (t), i = 0, 1, 2, 3, 4.$ (16) using the evolution model^Citation9 and how reliably the pairwise Needleman-Wunsch alignment is applied to cross-species matching of nucleotides which are supposed to come from the same ancestor entry {–, A, T, C or G}. The former depends on the divergence duration from the ancestor/root to the reference sequence and the latter may depend on the species-specific adjustment of pairwise alignments based on phylogenic information. When this piece of information is not immediately available, the algorithms by Yang,^Citation15 Redelings and Suchard,^Citation11 and MEGA package^Citation14 are useful. Recently, Wong and colleagues^Citation16 demonstrated that various alignments may lead to quite inconsistent inference. Although distance estimation for multiple species from a common ancestor may lack some accuracy using only one sequence set (), we used MEGA package for phylogenic structure information for real sequence set probability estimation. Note that we only use background sequences as examples to demonstrate our algorithms by assuming independent tetranomial distribution among {ATCG} along sequences. For the set of orthologous sequences involving many species (>3), we follow the evolutionary process (described by a phylogenic tree) to sample the internal nodes within the phylogenic tree conditional on one selected reference sequence (a terminal node on the phylogenic tree) and apply Monte Carlo integration to these imputed internal nodes for obtaining LogPrs (we omit the details). As one referee points out, it may be unreliable to directly apply our algorithms to sequences with very irregular lengths, since the insertion–deletion events need to be identified by matching nucleotides across all involved species other than due to artificial sequence truncation. Thus a crude multiple alignment across such sequences may overly produce insertion and/or deletions. A rough solution may involve first applying multiple alignment procedures to these sequences and then segmenting the aligned sequences into subsequences involving different numbers of species followed by segment-wise Monte Carlo integration. However, the internal edge-effects introduced by segmentation deserves further study. Lastly, we highlight that applying the proposed algorithms to real sequences is not so straightforward in view of heterogeneous conservation patterns along the orthologous sequences, which poses as an important future research topic.

Acknowledgments

We thank Terence P Speed for his directions on evolution models when he visited Yale Center for Statistical Genomics and Proteomics (YCSGP) in May 2004. We are also grateful to Stéphane Robin and many anonymous referees for their constructive and insightful comments which greatly improved our work.

Disclosure

The authors report no conflicts of interest in this work.

References

NeedlemanSBWunschCDA general method applicable to the search for similarities in the amino acid sequence of two proteinsJ Mol Biol1970484434535420325
PubMed Web of Science ®Google Scholar
LiuJSNeuwaldAFLawrenceCEMarkovian structures in biological sequence alignmentJ Am Stat Assoc199494115
Google Scholar
KellisMPattersonNEndrizziMBirrenBLanderESSequencing and comparison of yeast species to identify genes and regulatory elementsNature200342324125412748633
PubMed Web of Science ®Google Scholar
MosesAMChiangDYEisenMBPhylogenetic motif detection by expectation-maximization on evolutionary mixturesPac Symp Biocomput200432433514992514
PubMedGoogle Scholar
XieJLiK-CBinaMA Bayesian insertion/deletion algorithm for distant protein motif searching via entropy filteringJ Am Stat Assoc200499466409420
Web of Science ®Google Scholar
WeiZJensenSTGAME: detecting cis-regulatory elements using a genetic algorithmBioinformatics2006221577158416632495
PubMed Web of Science ®Google Scholar
SinhaSHeXMORPH: Probabilistic alignment combined with hidden Markov models of cis-regulatory modulesPLoS Comput Biol2007311e21617997594
PubMed Web of Science ®Google Scholar
JukesTHCantorCREvolution of protein moleculesMunroHNMammalian Protein MetabolismNew YorkAcademic Press196921132
Google Scholar
RivasEEvolutionary models for insertions and deletions in a probabilistic modeling frameworkBMC Bioinformatics 200566315780137
PubMed Web of Science ®Google Scholar
LutzoniFWagnerPReebVZollerSIntegrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homologySyst Biol20004962865112116431
PubMed Web of Science ®Google Scholar
RedelingsBDSuchardMAJoint Bayesian estimation of alignment and phylogenySyst Biol200554340141816012107
PubMed Web of Science ®Google Scholar
HigginsDThompsonJGibsonTThompsonJDHigginsDGGibsonTJCLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choiceNucleic Acids Res199422467346807984417
PubMed Web of Science ®Google Scholar
BlancoEFarréDAlbàMMesseguerXGuigòRABS: a database of annotated regulatory binding sites from orthologous promotersNucleic Acids Res200634D636716381947
PubMed Web of Science ®Google Scholar
TamuraKDudleyJNeiMKumarSMEGA4: Molecular evolutionary genetics analysisMol Biol Evol2007241596159917488738
PubMed Web of Science ®Google Scholar
YangZMaximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sitesMol Biol Evol199310139614018277861
PubMed Web of Science ®Google Scholar
WongKMSuchardMAHuelsenbeckJPAlignment uncertainty and genomic analysisScience200831947347618218900
PubMed Web of Science ®Google Scholar

On calculating the probability of a set of orthologous sequences

Abstract

Introduction