Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)

James A. TotterdellBiostatistician, Waverton, NSW, AustraliaView further author information

Darfiana NurSchool of Computer Science, Engineering and Mathematics, Flinders University, Tonsley, SA, AustraliaCorrespondence[email protected]
View further author information

Kerrie L. MengersenSchool of Mathematical Sciences, Queensland University of Technology and The Australian Research Council (ARC) Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS), Brisbane, QLD, AustraliaView further author information

ABSTRACT

Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. Due to the latent nature of the segments a natural approach to segmentation that has gained favour recently uses Bayesian hidden Markov models (HMMs). Concomitantly in the last few decades, the free R programming language has become a dominant tool for computational statistics, visualization and data science. Therefore, this paper aims to fully exploit R to fit a Bayesian HMM for DNA segmentation. The joint posterior distribution of parameters in the model to be considered is derived followed by the algorithms that can be used for estimation. Functions following these algorithms (Gibbs Sampling, Data Augmentation and Label Switching) are then fully implemented in R. The methodology is assessed through extensive simulation studies and then being applied to analyse Simian Vacuolating virus (SV40). It is concluded that: (1) the algorithms and functions in R can correctly estimate sequence segmentation if the HMM structure is assumed; (2) the performance of the model improves with sequence length; (3) R is reasonably fast for short to medium sequence lengths and number of segments and (4) the segmentation of SV40 appears to correspond with the two major transcripts, early and late, that regulate the expression of SV40 genes.

KEYWORDS:

AMS SUBJECT CLASSIFICATION:

Acknowledgements

The authors thank one anonymous referee for the valuable suggestions and comments. This paper arises from the BMath (Honours) thesis of J. Totterdell at the University of Newcastle, Australia.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

The research of D. Nur was supported by the Establishment Grant funded by the Faculty of Science and Engineering and the School of Computer Science and Engineering and Mathematics at Flinders University, Australia.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)

Information for

Open access

Opportunities

Help and information

Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)

ABSTRACT

Acknowledgements

Disclosure statement

Additional information

Funding

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature