Search in:

Journal of Experimental & Theoretical Artificial Intelligence Volume 25, 2013 - Issue 2

Submit an article Journal homepage

461

Views

CrossRef citations to date

Altmetric

Original Articles

Significance tests or confidence intervals: which are preferable for the comparison of classifiers?

Daniel Berrar Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama 226-8502, JapanCorrespondence[email protected]

Jose A. Lozano Department of Computer Science and Artificial Intelligence, Intelligent Systems Group, University of the Basque Country UPV/EHU, Manuel de Lardizabal, 1, 20018 Donostia–San Sebastián, Gipuzkoa, Spain

Pages 189-206 | Received 22 Jul 2011, Accepted 26 Feb 2012, Published online: 26 Apr 2012

Cite this article
https://doi.org/10.1080/0952813X.2012.680252

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

Berger , JO and Berry , DA . 1988 . Statistical Analysis and the Illusion of Objectivity . American Scientist , 76 : 159 – 165 .
Web of Science ®Google Scholar
Berrar , D , Bradbury , I and Dubitzky , W . 2006 . Avoiding Model Selection Bias in Small-sample Genomic Data Sets . Bioinformatics , 22 : 1245 – 1250 .
PubMed Web of Science ®Google Scholar
Bouckaert , RR and Frank , E . 2004 . Evaluating the ReplicabilIty of Significance Tests for Comparing Learning Algorithms . Advances in Knowledge Discovery and Data Mining , 3056 : 3 – 12 .
Web of Science ®Google Scholar
Breiman , L . 2001 . Random Forests . Machine Learning , 45 : 5 – 32 .
Web of Science ®Google Scholar
Cawley , GC and Talbot , NLC . 2010 . On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation . Journal of Machine Learning Research , 11 : 2079 – 2107 .
Web of Science ®Google Scholar
Cummings , G . 2012 . Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis , New York/London : Routledge, Taylor & Francis Group .
Google Scholar
Demšar , J . 2006 . Statistical Comparisons of Classifiers Over Multiple Data Sets . Journal of Machine Learning Research , 7 : 1 – 30 .
Web of Science ®Google Scholar
Demšar , J . 2008 . On the Appropriateness of Statistical Tests in Machine Learning . in Proceedings of ICML 2008 Workshop on Evaluation Methods for Machine Learning II , Helsinki, Finland, 5–9 July 2008
Google Scholar
Denis, D. (2003), ‘Alternatives to Null Hypothesis Significance Testing’, Theory & Science, 4(1). Available online at http://theoryandscience.icaap.org/content/vol4.1/02_denis.html. Accessed 19 April 2012
Google Scholar
Dietterich , TG . 1998 . Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms . Neural Computation , 10 : 31 – 36 .
Google Scholar
Dixon , P . 1998 . Why Scientists Value p Values . Psychonomic Bulletin & Review , 5 : 390 – 396 .
Web of Science ®Google Scholar
Drummond , C . (2006), ‘Machine Learning as an Experimental Science, Revisited’, in Proceedings of the 21st National Conference on Artificial Intelligence: Workshop on Evaluation Methods for Machine Learning, AAAI Press Technical Report WS-06-06, pp. 1–5
Google Scholar
Drummond , C . 2008 . Finding a Balance Between Anarchy and Orthodoxy . in Proceedings of ICML 2008 Workshop on Evaluation Methods for Machine Learning II , Helsinki, Finland, 5–9 July 2008
Google Scholar
Drummond , C and Japkowicz , N . 2010 . Warning: Statistical Benchmarking is Addictive. Kicking the Habit in Machine Learning . Journal of Experimental and Theoretical Artificial Intelligence , 2 : 67 – 80 .
Google Scholar
Dugas , C and Gadoury , D . 2010 . Pointwise Exact Bootstrap Distributions of ROC Curves . Machine Learning , 78 : 103 – 136 .
Web of Science ®Google Scholar
Fraley , RC and Marks , MJ . 2007 . “ The Null Hypothesis Significance Testing Debate and its Implications for Personality Research ” . In Handbook of Research Methods in Personality Psychology , Edited by: Robins , RW , Fraley , RC and Krueger , RF . 149 – 169 . New York : Guilford .
Google Scholar
Frank , A . and Asuncion, A. (2010), ‘UcI Machine Learning Repository’, URL http://archive.ics.uci.edu/ml
Google Scholar
Garcia , S and Herrera , F . 2008 . An Extension on Statistical Comparisons of Classifiers Over Multiple Data Sets for All Pairwise Comparisons . Journal of Machine Learning Research , 9 : 2677 – 2694 .
Web of Science ®Google Scholar
Goodman , S . 1993 . P Values, Hypothesis Tests, and Likelihood: Implications for Epidemiology of a Neglected Historical Debate . American Journal of Epidemiology , 137 : 485 – 496 .
PubMed Web of Science ®Google Scholar
Goodman , S . 1999 . Toward Evidence-based Medical Statistics. 1: The p Value Fallacy . Annals of Internal Medicine , 130 : 995 – 1004 .
PubMed Web of Science ®Google Scholar
Goodman , S . 2008 . A Dirty Dozen: Twelve p-Value Misconceptions . Seminars in Hematology , 45 : 135 – 140 .
PubMed Web of Science ®Google Scholar
Hand , D . 2006 . Classifier Technology and the Illusion of Progres . Statistical Science , 21 : 1 – 14 .
PubMed Web of Science ®Google Scholar
Hastie , T , Tibshirani , R and Friedman , J . 2008 . The Elements of Statistical Learning, , 2nd , New York, Berlin, Heidelberg : Springer .
Google Scholar
Holland , B . 1991 . On the Application of Three Modified Bonferroni Procedures to Pairwise Multiple Comparisons in Balanced Repeated Measures Designs . Computational Statistics Quarterly , 6 : 219 – 231 .
Google Scholar
Holm , S . 1979 . A Simple Sequentially Rejective Multiple Test Procedure . Scandinavian Journal of Statistics , 6 : 65 – 70 .
Web of Science ®Google Scholar
International Committee of Medical Journal Editors (1997), ‘Uniform Requirements for Manuscripts Submitted to Biomedical Journals’, New England Journal of Medicine, 336, 309–315
Google Scholar
Johnson , DH . 1999 . The Insignificance of Statistical Significance Testing . Journal of Wildlife Management , 63 : 763 – 772 .
Web of Science ®Google Scholar
Kibler , D . and Langley, P. (1988), ‘Machine Learning as an Experimental Science’, in Proceedings of the 7th International Conference on Machine Learning, pp. 1207–1211
Google Scholar
Langley , P . 2011 . Machine Learning As an Experimental Science . Machine Learning , 82 : 275 – 279 .
Web of Science ®Google Scholar
Leslie , C . 2008 . “ Exhaustive Conditional Inference: Improving the Evidential Value of a Statistical Test by Identifying the Most Relevant p-value and Error Probabilities ” . In PhD Thesis , Australia : University of Melbourne .
Google Scholar
Levin , JR and Robinson , DH . 1999 . Further Reflections on Hypothesis Testing and Editorial Policy for Primary Research Journals . Educational Psychology Review , 11 ( 2 ) : 143 – 155 .
Web of Science ®Google Scholar
Manly , KF , Nettleton , D and Hwang , JT . 2004 . Genomics, Prior Probability, and Statistical Tests of Multiple Hypotheses . Genome Research , 14 : 997 – 1001 .
PubMed Web of Science ®Google Scholar
May , WL and Johnson , WD . 1997 . Confidence Intervals for Differences in Correlated Binary Proportions . Statistics in Medicine , 16 : 2127 – 2136 .
PubMed Web of Science ®Google Scholar
McNemar , Q . 1947 . Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages . Psychometrika , 12 : 153 – 157 .
PubMed Web of Science ®Google Scholar
Mulaik , SA , Raju , NS and Harshman , RA . 2007 . “ There Is a Time and a Place for Significance Testing ” . In What If There Were No Significance Tests? , Edited by: Harlow , LL , Mulaik , SA and Steiger , JH . 65 – 115 . New Jersey (USA) : Lawrence Erlbaum Associates .
Google Scholar
Nadeau , C and Bengio , Y . 2003 . Inference for the Generalization Error . Machine Learning , 52 : 239 – 281 .
Web of Science ®Google Scholar
Nix , TW and Barnette , JJ . 1998 . The Data Analysis Dilemma: Ban or Abandon. A Review of Null Hypothesis Significance Testing . Research in the Schools , 5 : 3 – 14 .
Google Scholar
Ojala , M and Garriga , GC . 2010 . Permutation Tests for Studying Classifier Performance . Journal of Machine Learning Research , 11 : 1833 – 1863 .
Web of Science ®Google Scholar
Poole , C . 2001 . Low p-values or Narrow Confidence Intervals: Which Are More Durable? . Epidemiology , 12 : 291 – 294 .
PubMed Web of Science ®Google Scholar
Quesenberry , CP and Hurst , DC . 1964 . Large Sample Simultaneous Confidence Intervals for Multinomial Proportions . Technometrics , 6 : 191 – 195 .
Web of Science ®Google Scholar
Team , R Development Core . (2009), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing
Google Scholar
Robinson , GK . 1978 . On the Necessity of Bayesian Inference and the Construction of Measures of Nearness to Bayesian Form . Biometrika , 65 : 49 – 52 .
Web of Science ®Google Scholar
Rothman , J . 1978 . A Show of Confidence . New England Journal of Medicine , 299 : 1362 – 1363 .
PubMed Web of Science ®Google Scholar
Schmidt , FL . 1996 . Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers . Psychological Methods , 1 : 115 – 129 .
Web of Science ®Google Scholar
Sheskin , DJ . 2007 . Handbook of Parametric and Nonparametric Statistical Procedures , New York : Chapman and Hall, CRC .
Google Scholar
Sotiriou , C , Wirapati , P , Loi , S , Harris , A , Fox , S , Smeds , J , Nordgren , H , Farmer , P , Praz , V , HaibeKains , B , Desmedt , C , Larsimont , D , Cardoso , F , Peterse , H , Nuyten , D , Buyse , M , van de Vijver , MJ , Bergh , J , Piccart , M and Delorenzi , M . 2006 . Gene Expression Profiling in Breast Cancer: UnderstandIng the Molecular Basis of HistoLogic Grade to Improve Prognosis . Journal of the National Cancer Institute , 98 : 262 – 272 .
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Significance tests or confidence intervals: which are preferable for the comparison of classifiers?

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Significance tests or confidence intervals: which are preferable for the comparison of classifiers?

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date