7,001
Views
7
CrossRef citations to date
0
Altmetric
Article; Bioinformatics

DeepFinder: An integration of feature-based and deep learning approach for DNA motif discovery

, , &
Pages 759-768 | Received 17 Feb 2017, Accepted 03 Feb 2018, Published online: 10 Feb 2018

References

  • Maston GA, Evans SK, Green MR. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet. 2006;7:29–59.
  • Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform. 2013;14:225–237.
  • Poliakov A, Foong J, Brudno M, et al. GenomeVISTA—an integrated software package for whole-genome alignment and visualization. Bioinformatics. 2014;30:2654–2655.
  • Brudno M, Do CB, Cooper GM, et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13:721–731.
  • Kurtz S, Phillippy A, Delcher AL, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004 [cited 2017 Mar 12];5:R12. DOI:10.1186/gb-2004-5-2-r12
  • Bray N, Dubchak I, Pachter L. AVID: A global alignment program. Genome Res. 2003;13:97–102.
  • Ovcharenko I, Loots GG, Giardine BM, et al. Mulan: multiple-sequence local alignment and visualization for studying function and evolution. Genome Res. 2005;15:184–194.
  • Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197.
  • Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453.
  • Blanchette M, Kent WJ, Riemer C, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715.
  • Al Ait L, Yamak Z, Morgenstern B. DIALIGN at GOBICS–multiple sequence alignment using various sources of external information. Nucleic Acids Res. 2013;41:W3–7.
  • King DC, Taylor J, Zhang Y, et al. Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res. 2007;17:775–786.
  • Kel AE, Gössling E, Reuter I, et al. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003;31:3576–3579.
  • Wang D, Lee NK. MISCORE: Mismatch-based matrix similarity scores for DNA motif detection. In: Köppen M, Kasabov N, Coghill G, editors. Adv. Neuro-Information Process. Berlin, Heidelberg: Springer; 2009. p. 478–485.
  • Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27:1017–1018.
  • Bailey T, Boden M, Whitington T, et al. The value of position-specific priors in motif discovery using MEME. BMC Bioinformatics. 2010 [cited 2017 Mar 12];11:179. DOI:10.1186/1471-2105-11-179.
  • Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23.
  • Bi Y, Kim H, Gupta R, et al. Tree-based position weight matrix approach to model transcription factor binding site profiles. PLoS One. 2011 [cited 2017 Mar 12];6:e24210. DOI:10.1371/journal.pone.0024210.
  • Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998;14:48–54.
  • Osada R, Zaslavsky E, Singh M. Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics. 2004;20:3516–3525.
  • Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005;33:4899–4913.
  • Hughes JD, Estep PW, Tavazoie S, et al. Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000;296:1205–1214.
  • Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2:28–36.
  • Liu XS, Brutlag DL, Liu JS. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symp Biocomput. 2001;6:127–138.
  • Thijs G, Marchal K, Lescot M, et al. Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol. 2002;9:447–464.
  • Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol. 2002;20:835–839.
  • Pavesi G, Mauri G, Pesole G. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001;17:S207–214.
  • Wei Z, Jensen ST. GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics. 2006;22:1577–1584.
  • Fogel GB, Weekes DG, Varga G, et al. Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res. 2004;32:3826–3835.
  • Bailey TL. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics. 2011;27:1653–1659.
  • Shi J, Yang W, Chen M, et al. AMD, an automated motif discovery tool using stepwise refinement of gapped consensuses. Aiyar A, editor. PLoS One. 2011 [cited 2017 Mar 12];6:e24576. DOI:10.1371/journal.pone.0024576.
  • Lee NK, Fong PK, Abdullah MT. Modelling complex features from histone modification signatures using genetic algorithm for the prediction of enhancer region. Bio-Medical Mater Eng. 2014;24:3807–3814.
  • Zia A, Moses AM. Towards a theoretical understanding of false positives in DNA motif finding. BMC Bioinformatics. 2012 [cited 2017 Mar 12];13:151. DOI:10.1186/1471-2105-13-151.
  • Thomas-Chollier M, Herrmann C, Defrance M, et al. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res. 2012 [cited 2017 Mar 12];40:e31. DOI:10.1093/nar/gkr1104.
  • Li L. GADEM: A genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol. 2009;16:317–329.
  • Kuttippurathu L, Hsing M, Liu Y, et al. CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments. Bioinformatics. 2011;27:715–717.
  • Carroll JS, Meyer CA, Song J, et al. Genome-wide analysis of estrogen receptor binding sites. Nat Genet. 2006;38:1289–1297.
  • Wei C-L, Wu Q, Vega VB, et al. A global map of p53 transcription-factor binding sites in the human genome. Cell. 2006;124:207–219.
  • Lee NK, Wang D. Optimization of MISCORE-based motif identification systems. 3rd International Conference on Bioinformatics and Biomedical Engineering (ICBBE 2009); 2009; Beijing, China.
  • Alipanahi B, Delong A, Weirauch MT, et al. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33:831–838.
  • Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016 [cited 2017 Mar 12];44:e107. DOI:10.1093/nar/gkw226.
  • Sun H, Yuan Y, Wu Y, et al. Tmod: toolbox of motif discovery. Bioinformatics. 2010;26:405–407.
  • de Hoon MJL, Imoto S, Nolan J, et al. Open source clustering software. Bioinformatics. 2004;20:1453–1454.
  • Wijaya E, Yiu S-M, Son NT, et al. MotifVoter: a novel ensemble method for fine-grained integration of generic motif finders. Bioinformatics. 2008;24:2288–2295.
  • Yáñez-Cuna JO, Arnold CD, Stampfel G, et al. Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 2014;24:1147–1156.
  • Vincent P, Larochelle H, Bengio Y, et al. Extracting and composing robust features with denoising autoencoders. In: Proceeding of 25th International Conference on Machine Learning; 2008. p. 1096–1103. New York, NY: ACM.
  • Giardine B, Riemer C, Hardison RC, et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005;15:1451–1455.
  • Palm RB. Deep learning toolbox. [2015-09]. http://www.mathworks.com/matlabcentral/fileex-change/38310-deep-learning-toolbox. 2012.
  • Rosenbloom KR, Armstrong J, Barber GP, et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 2015;43:D670–D681.
  • Manning CD, Raghavan P, Schütze H, et al. Introduction to information retrieval. Cambridge: Cambridge University Press; 2008.
  • Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995; 57:289–300.
  • Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta (BBA)-Protein Struct. 1975;405:442–451.
  • Haykin S. Neural networks: A comprehensive foundation. 2nd ed. Upper Saddle River (NJ, USA): Prentice Hall PTR; 1998.
  • Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–287.