References
- Berger , JO and Berry , DA . 1988 . Statistical Analysis and the Illusion of Objectivity . American Scientist , 76 : 159 – 165 .
- Berrar , D , Bradbury , I and Dubitzky , W . 2006 . Avoiding Model Selection Bias in Small-sample Genomic Data Sets . Bioinformatics , 22 : 1245 – 1250 .
- Bouckaert , RR and Frank , E . 2004 . Evaluating the ReplicabilIty of Significance Tests for Comparing Learning Algorithms . Advances in Knowledge Discovery and Data Mining , 3056 : 3 – 12 .
- Breiman , L . 2001 . Random Forests . Machine Learning , 45 : 5 – 32 .
- Cawley , GC and Talbot , NLC . 2010 . On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation . Journal of Machine Learning Research , 11 : 2079 – 2107 .
- Cummings , G . 2012 . Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis , New York/London : Routledge, Taylor & Francis Group .
- Demšar , J . 2006 . Statistical Comparisons of Classifiers Over Multiple Data Sets . Journal of Machine Learning Research , 7 : 1 – 30 .
- Demšar , J . 2008 . On the Appropriateness of Statistical Tests in Machine Learning . in Proceedings of ICML 2008 Workshop on Evaluation Methods for Machine Learning II , Helsinki, Finland, 5–9 July 2008
- Denis, D. (2003), ‘Alternatives to Null Hypothesis Significance Testing’, Theory & Science, 4(1). Available online at http://theoryandscience.icaap.org/content/vol4.1/02_denis.html. Accessed 19 April 2012
- Dietterich , TG . 1998 . Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms . Neural Computation , 10 : 31 – 36 .
- Dixon , P . 1998 . Why Scientists Value p Values . Psychonomic Bulletin & Review , 5 : 390 – 396 .
- Drummond , C . (2006), ‘Machine Learning as an Experimental Science, Revisited’, in Proceedings of the 21st National Conference on Artificial Intelligence: Workshop on Evaluation Methods for Machine Learning, AAAI Press Technical Report WS-06-06, pp. 1–5
- Drummond , C . 2008 . Finding a Balance Between Anarchy and Orthodoxy . in Proceedings of ICML 2008 Workshop on Evaluation Methods for Machine Learning II , Helsinki, Finland, 5–9 July 2008
- Drummond , C and Japkowicz , N . 2010 . Warning: Statistical Benchmarking is Addictive. Kicking the Habit in Machine Learning . Journal of Experimental and Theoretical Artificial Intelligence , 2 : 67 – 80 .
- Dugas , C and Gadoury , D . 2010 . Pointwise Exact Bootstrap Distributions of ROC Curves . Machine Learning , 78 : 103 – 136 .
- Fraley , RC and Marks , MJ . 2007 . “ The Null Hypothesis Significance Testing Debate and its Implications for Personality Research ” . In Handbook of Research Methods in Personality Psychology , Edited by: Robins , RW , Fraley , RC and Krueger , RF . 149 – 169 . New York : Guilford .
- Frank , A . and Asuncion, A. (2010), ‘UcI Machine Learning Repository’, URL http://archive.ics.uci.edu/ml
- Garcia , S and Herrera , F . 2008 . An Extension on Statistical Comparisons of Classifiers Over Multiple Data Sets for All Pairwise Comparisons . Journal of Machine Learning Research , 9 : 2677 – 2694 .
- Goodman , S . 1993 . P Values, Hypothesis Tests, and Likelihood: Implications for Epidemiology of a Neglected Historical Debate . American Journal of Epidemiology , 137 : 485 – 496 .
- Goodman , S . 1999 . Toward Evidence-based Medical Statistics. 1: The p Value Fallacy . Annals of Internal Medicine , 130 : 995 – 1004 .
- Goodman , S . 2008 . A Dirty Dozen: Twelve p-Value Misconceptions . Seminars in Hematology , 45 : 135 – 140 .
- Hand , D . 2006 . Classifier Technology and the Illusion of Progres . Statistical Science , 21 : 1 – 14 .
- Hastie , T , Tibshirani , R and Friedman , J . 2008 . The Elements of Statistical Learning, , 2nd , New York, Berlin, Heidelberg : Springer .
- Holland , B . 1991 . On the Application of Three Modified Bonferroni Procedures to Pairwise Multiple Comparisons in Balanced Repeated Measures Designs . Computational Statistics Quarterly , 6 : 219 – 231 .
- Holm , S . 1979 . A Simple Sequentially Rejective Multiple Test Procedure . Scandinavian Journal of Statistics , 6 : 65 – 70 .
- International Committee of Medical Journal Editors (1997), ‘Uniform Requirements for Manuscripts Submitted to Biomedical Journals’, New England Journal of Medicine, 336, 309–315
- Johnson , DH . 1999 . The Insignificance of Statistical Significance Testing . Journal of Wildlife Management , 63 : 763 – 772 .
- Kibler , D . and Langley, P. (1988), ‘Machine Learning as an Experimental Science’, in Proceedings of the 7th International Conference on Machine Learning, pp. 1207–1211
- Langley , P . 2011 . Machine Learning As an Experimental Science . Machine Learning , 82 : 275 – 279 .
- Leslie , C . 2008 . “ Exhaustive Conditional Inference: Improving the Evidential Value of a Statistical Test by Identifying the Most Relevant p-value and Error Probabilities ” . In PhD Thesis , Australia : University of Melbourne .
- Levin , JR and Robinson , DH . 1999 . Further Reflections on Hypothesis Testing and Editorial Policy for Primary Research Journals . Educational Psychology Review , 11 ( 2 ) : 143 – 155 .
- Manly , KF , Nettleton , D and Hwang , JT . 2004 . Genomics, Prior Probability, and Statistical Tests of Multiple Hypotheses . Genome Research , 14 : 997 – 1001 .
- May , WL and Johnson , WD . 1997 . Confidence Intervals for Differences in Correlated Binary Proportions . Statistics in Medicine , 16 : 2127 – 2136 .
- McNemar , Q . 1947 . Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages . Psychometrika , 12 : 153 – 157 .
- Mulaik , SA , Raju , NS and Harshman , RA . 2007 . “ There Is a Time and a Place for Significance Testing ” . In What If There Were No Significance Tests? , Edited by: Harlow , LL , Mulaik , SA and Steiger , JH . 65 – 115 . New Jersey (USA) : Lawrence Erlbaum Associates .
- Nadeau , C and Bengio , Y . 2003 . Inference for the Generalization Error . Machine Learning , 52 : 239 – 281 .
- Nix , TW and Barnette , JJ . 1998 . The Data Analysis Dilemma: Ban or Abandon. A Review of Null Hypothesis Significance Testing . Research in the Schools , 5 : 3 – 14 .
- Ojala , M and Garriga , GC . 2010 . Permutation Tests for Studying Classifier Performance . Journal of Machine Learning Research , 11 : 1833 – 1863 .
- Poole , C . 2001 . Low p-values or Narrow Confidence Intervals: Which Are More Durable? . Epidemiology , 12 : 291 – 294 .
- Quesenberry , CP and Hurst , DC . 1964 . Large Sample Simultaneous Confidence Intervals for Multinomial Proportions . Technometrics , 6 : 191 – 195 .
- Team , R Development Core . (2009), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing
- Robinson , GK . 1978 . On the Necessity of Bayesian Inference and the Construction of Measures of Nearness to Bayesian Form . Biometrika , 65 : 49 – 52 .
- Rothman , J . 1978 . A Show of Confidence . New England Journal of Medicine , 299 : 1362 – 1363 .
- Schmidt , FL . 1996 . Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers . Psychological Methods , 1 : 115 – 129 .
- Sheskin , DJ . 2007 . Handbook of Parametric and Nonparametric Statistical Procedures , New York : Chapman and Hall, CRC .
- Sotiriou , C , Wirapati , P , Loi , S , Harris , A , Fox , S , Smeds , J , Nordgren , H , Farmer , P , Praz , V , HaibeKains , B , Desmedt , C , Larsimont , D , Cardoso , F , Peterse , H , Nuyten , D , Buyse , M , van de Vijver , MJ , Bergh , J , Piccart , M and Delorenzi , M . 2006 . Gene Expression Profiling in Breast Cancer: UnderstandIng the Molecular Basis of HistoLogic Grade to Improve Prognosis . Journal of the National Cancer Institute , 98 : 262 – 272 .