Abstract
Phylogenetic Trees are critical in human genome research for investigating human evolution and identifying disease-associated genetic markers. New high-throughput genome sequencing technologies raise an urgent need to develop statistical methods that can construct phylogenetic trees from long genome sequences with quick computation speeds, while considering various biological complexities. Though an ancestral mixture model has been proposed [Chen SC, Lindsay BG. Building mixture trees from binary sequence data. Biometrika. 2006;93(4):843–860. doi: 10.1093/biomet/93.4.843] to this end by allowing genetic mutations over generations, another essential evolution factor, genetic recombination, is missed. Therefore, in this paper, we develop a novel genetic recombination model for tree construction and propose to use Markov chain composite likelihood (MCCL) to make model estimation computationally feasible. To further reduce computation complexity, a hierarchical estimator is constructed to estimate unknown ancestral distributions through MCCL. Simulation studies and real data example show that our proposed methods perform well and fast, so have the potential for implementation in long sequence genome data.
Acknowledgments
The authors express their sincere gratitude for the reviewer's insightful comments and valuable suggestions, which have significantly contributed to the enhancement of this manuscript. During the preparation of this work, Dr. Bruce G. Lindsay passed away due to an illness. We miss this brilliant statistician, wise and excellent advisor, and warm friend dearly. The first author is particularly grateful to Professor Lindsay's invaluable mentor, support, and encouragement to her early stage research, which the author will treasure forever.
Disclosure statement
No potential conflict of interest was reported by the author(s).