5,583
Views
5
CrossRef citations to date
0
Altmetric
Reforming Institutions: Changing Publication Policies and Statistical Education

Content Audit for p-value Principles in Introductory Statistics

, , &

ABSTRACT

Longstanding concerns with the role and interpretation of p-values in statistical practice prompted the American Statistical Association (ASA) to make a statement on p-values. The ASA statement spurred a flurry of responses and discussions by statisticians, with many wondering about the steps necessary to expand the adoption of these principles. Introductory statistics classrooms are key locations to introduce and emphasize the nuance related to p-values; in part because they engrain appropriate analysis choices at the earliest stages of statistics education, and also because they reach the broadest group of students. We propose a framework for statistics departments to conduct a content audit for p-value principles in their introductory curriculum. We then discuss the process and results from applying this course audit framework within our own statistics department. We also recommend meeting with client departments as a complement to the course audit. Discussions about analyses and practices common to particular fields can help to evaluate if our service courses are meeting the needs of client departments and to identify what is needed in our introductory courses to combat the misunderstanding and future misuse of p-values.

1 Introduction

The publication of this journal’s special issue reflects a growing consensus: p-values are often misused, and that misuse often leads to bad science. Many argue that the main challenges are to understand the logic of testing scientific hypotheses, and to get away from mechanical rules like p < 0.05 as a substitute for contextual reasoning. The logic of testing, we argue, is best taught in a first statistics course. However, research in statistics education makes clear that this logic is far harder to teach and to learn than is the simple p < 0.05. This poses a particular challenge for those who teach introductory statistics courses. In this article, we propose a process for auditing the coverage of p-value principles in an introductory statistics course.

The misuse of p-values is frequent, well-documented, and potentially leads to bad science; most notably, the “reproducibility crisis.” Ioannidis sounded an alarm with his paper, “Why most published research findings are false” (Ioannidis Citation2005). A decade later, the Open Science Collaboration (Citation2015) repeated 100 experiments taken from the psychology literature. Of these hundred, only 39 produced results that replicated the original findings. In the health sciences, Greenland et al. (Citation2016) offered readers a catalogue of misinterpretations of p-values. As noted by Berry (Citation2016), “(o)ur collective credibility in the science community is at risk.” These and other articles led the American Statistical Association (ASA) to issue a statement on proper use of p-values (Wasserstein and Lazar Citation2016). The ASA statement spurred a flurry of responses and discussions by statisticians, along with the 2017 ASA Symposium on Statistical Inference. We now look toward the next steps necessary to expand the adoption of these principles.

Although there is a growing consensus about the nature of the problem, there is little consensus on simple remedies like banning p-values altogether or reducing the threshold for “significance” to p < 0.005. Most agree that the heart of the problem is reliance on mechanical rules like p < 0.05. Such rules cannot substitute for the logic of hypothesis testing applied in the scientific context. Unfortunately, decades of research have shown that this logic is not easy to learn (Falk and Greenbaum Citation1995; Williams Citation1999; Batanero Citation2000; Garfield and Ben-Zvi 2003; Harradine, Batanero, and Rossman Citation2011). DelMas et al. (Citation2007) gave a multiple choice test to students who had completed an introductory statistics course and found that only 54.5% could identify the correct interpretation for p-values; only 58.6% could identify incorrect interpretations. Rossman (Citation2008) cited the work of Nickerson (Citation2004) in cognitive psychology to argue that one explanation “surely rests in all of the research that has shown how difficult probabilistic reasoning is for people.”

Stangl (Citation2016) argued that we have a responsibility in statistics education to preempt and end the perpetual misuse of p-values. Cobb (2016) responded to the ASA statement by saying, “(w)hat ASA has done here should spur a reshaping of the way we teach—both p-values in particular and statistics generally.” There are many reasons to focus on the introductory course in statistics. For many students it is their first encounter with the logic of statistical inference. For most of those students it is also their last formal encounter with that logic in an academic setting. Moreover, courses that introduce statistical thinking and methods reach a very large percentage of the students who will become practicing scientists and evidence-based decision makers.

The ASA-sponsored report, Guidelines for Assessment and Instruction in Statistics Education (GAISE), outlines goals and methods for teaching introductory statistics (Carver et al. Citation2016). The GAISE report sets as a goal “(s)tudents should demonstrate an understanding of, and be able to use, basic ideas of statistical inference, both hypothesis tests and interval estimation, in a variety of settings.” Millar (Citation2016) argued that “(s)tudents of other disciplines will be in our service courses, and while we should not advocate for hypothesis tests as the monolithic statistical inference method, they do need to know what it is and what its shortcomings are because they will encounter it.” Goodman (Citation2016) also pointed out the responsibility our courses have to future scientists, saying “(t)he fact that statisticians do not all accept at face value what most scientists are routinely taught as uncontroversial truisms will be a shock to many. But if we are to move science forward, we must speak to scientists.” Berry (Citation2016) was more direct: “We must communicate better even if we have to scream from the rooftops.” But as statisticians, before we scream, we should gather data.

In the remainder of this article we describe and illustrate the use of a rubric that can be used to assess how well a course addresses the ASAs principles for sound use of p-values and the use of focus group discussions between teachers of statistics and their colleagues in the sciences. We hope these will serve as tools to help frame systematic conversations within and across departmental lines. In what follows, Section 2 describes the rubric and focus group questions. Section 3 illustrates an example of conducting the course audit with these methods. Section 4 concludes with a discussion.

2 Methods

For a department to perform a comprehensive audit of p-values concepts in their introductory statistics courses, we propose a framework that elicits both intra-departmental and inter-departmental feedback on the current curriculum. The combination of introspection and external suggestions can then drive targeted curricular adaptation to better cover the challenging topics related to p-values. The key idea for this assessment is that each of the six principles articulated by the ASA statement provide a target that can be directly evaluated with a common rubric as we discuss next.

2.1 Intra-Departmental Evaluation

For a department with several instructors involved in teaching introductory statistics courses, it may be helpful to use a common rubric as a unified starting point for identifying strengths and/or weaknesses in teaching about p-values. We propose the rubric found in . The rubric is structured to be applied to each of the six principles from the ASA statement:

Table 1 Curricular component rubric.

  1. p-values can indicate how incompatible the data are with a specified statistical model.

  2. p-values do not measure the probability that the studied hypothesis is true, or the probability that data were produced by random chance alone.

  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

  4. Proper inference requires full reporting and transparency.

  5. A p-value or statistical significance, does not measure the size of an effect or the importance of a result.

  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

The rubric evaluates how each principle is formally introduced by the instructor, reinforced with classroom activities, assessed for comprehension and is supported with appropriate supplementary learning resources. These broad categories of course components are used because they are likely to be present, in some form, in most introductory courses.

Rubric instructions for evaluating ASA p-value principle:

Repeat the following steps for each principle i = 1,…,6

  1. Reread the ith ASA p-value principle.

  2. Reflect on how completely, correctly and consistently each of the four broad curricular components (instruction, class activity, assessment and support materials) reflect the ith ASA p-value principle.

  3. Score the corresponding item based on the statement in that you feel most accurately describes how substantially the curricular component reflects the ith ASA p-value principle. Enter this in the curriculum audit report card in .

  4. Sum the score from the four curricular components. The sum will range from to 4. Treat this sum as a GPA value and assign a letter grade.(1.0=D,1.3=D+,1.7=C,2.0=C,2.3=C+,2.7=B,3.0=B,3.3=B+,3.7=A,4.0=A)

Table 2 Curriculum audit report card.

With such a general grading guide, the scores and letter grades are clearly subjective in nature but may provide an informative shorthand familiar to most educators. It is important to highlight the rubric is primarily designed as a support tool to help frame discussion and help target areas for improvement in an internally administered content audit. Creating an overall score is secondary, as it is not a nationally normed instrument for comparing quality of curriculum across all universities. The rubric can be completed collectively or by individual instructors and then used to drive a discussion on steps to improve the curriculum coverage of the ASA principles. Alternative rubrics could be devised by moving to simple dichotomous responses for the inclusion or exclusion of components covering particular ASA principles, or by including weights for components that a department finds more essential, or by breaking from the letter-grade theme altogether. In the end, we employed a relatively interpretable scale to provide a basis for reflection and discussion. If several department members identify a deficiency in teaching one of the principles from the ASA statement, a collective curricular remediation can be planned. In the case that a principle is covered well by one instructor but not by another, class materials and teaching advice can be shared to help patch the gap. A result of this audit conducted by a subgroup of a department might provide the basis for discussion at a departmental retreat or meeting.

We conducted the content audit for p-value principles within our department at Miami University, a mid-sized public Midwestern university. Multiple introductory statistics courses are offered through our department, each geared toward a different subpopulation of students. We applied the rubric to our algebra-based introductory statistics service course for undergraduate students, STA 261. This course serves an assortment of majors. The course begins with examples that challenge students to begin inferential thinking by using simulation-based methods to evaluate likelihood of observed data under assumed conditions. The curriculum then proceeds through a unit on probability and sampling distributions before introducing p-values in a probability-based framework.

This class is taught to 600 students each semester using a hybrid model with online introduction of concepts, just-in-time teaching of problematic concepts in a large lecture meeting, and a smaller lab section where statistical principles are explored. In a typical semester, two large lectures are taught by a continuing lecturer and four other large lectures are taught by term-limited faculty, namely Instructors or Visiting Assistant Professors. Graduate students facilitate lab sections. Course materials and labs are all centrally constructed by the course coordinator. In our case, the rubric was completed collectively by members of the author group and the results can be found in Section 3.1.

2.2 Inter-Departmental Evaluation

Next, we recommend running a small focus group discussion with analytically savvy members from client departments to gain an interdisciplinary perspective on the statistics curriculum at your institution. A great starting place is to identify departments with large numbers of undergraduate majors who are required take your introductory service course in statistics. These will often be psychology, political science and biology departments, but this will vary from campus to campus. The goals would be to advocate for good analytical practice in the scientific community, in this case pertaining to p-values, and to ask for candid feedback on how the statistics service classes meet the needs of their respective fields based on their general observations. Our focus group included faculty members from psychology, biology, geology, and kinesiology.

Along with an invitation to the meeting, we suggest sending a brief description with a web-link to the ASA p-value statement and a short list of questions you plan to discuss. We encourage you to consider a set of questions that can be used as discussion prompts. For example, when we conducted this focus group, we asked a number of questions to prompt additional discussion including:

  • Teaching in our introductory statistics service courses:

    • Do your students show understanding of p-values and hypothesis testing?

    • What methods should our introductory statistics course include?

  • Teaching in our advanced statistics service courses:

    • What do our advanced service courses do well to prepare your students?

    • Do your advanced students show understanding of p-values and hypothesis testing?

    • What is missing and what would you like to see us address in more detail?

  • Why is a fundamental understanding of statistics important in your field?

    • What attracted and engaged you personally with statistics?

3 Results

In Fall 2017, we conducted a content audit for the coverage of p-values in our algebra-based introductory statistics course and held a focus group meeting with representatives from client departments. The results of the content audit and focus group are summarized in Sections 3.1 and 3.2, respectively.

3.1 Application of Course Content Audit Rubric

contains the rubric that was collectively completed by members of the author team to evaluate the curriculum of the introductory course discussed in Section 2.1. The audit for our algebra-based introductory course found that we avoided improper probabilistic interpretations and clearly articulate the difference between statistical and practical significance as highlighted in principles two and five, respectively. However, the results of the audit suggested that we had ample room to improve our coverage of some principles. As a result, we developed a set of action items to address these areas’ curricular weaknesses. A natural follow up to the assessment is to document what features were highlighted in this evaluation as a justification of the assigned grade. provides this reflection.

Table 3 Completed curriculum audit report card.

Table 4 Results of our content audit with a short explanation specific to each principle.

The most urgent action item was to replace the rote procedural approach of running down the checklist of model assumptions taught to accompany each hypothesis test, and instead begin to frame hypothesis tests more comprehensively with respect to clearly specified statistical models. For example, we can discuss a hypothesis test for a population proportion of success equal to 0.5, where a binomial exact test runs from a simple and familiar binomial model. Here, we can note that the p-value calculated could reflect the true proportion differing from 0.5 or that the observations were not from a binomial experiment. Another action item from the audit is to include a case study to deconstruct a real-world analysis involving hypothesis tests, where we can highlight study design, dangers of “p-hacking,” effect sizes, integration of additional evidence, and resulting policy decisions. The last action item is to expand our discussions of reproducibility beyond documenting data handling to include all stages of a study, from design through analysis. The Reproducibility Project led by the Center for Open Science can provide a good launching point for this discussion (Weir Citation2015).

We have shared results of the audit of this introductory course curriculum with our colleagues at a recent departmental retreat. The discussion encouraged change to modify content and exercises to better address the ASA p-value principles throughout our curriculum. We also asked our colleagues to consider how we can similarly focus the curriculum of our upper-level classes to reinforce areas that are challenging to impart with students in an introductory course.

3.2 Summary of Focus Group Discussion

Looking externally for feedback, our focus group was comprised of five colleagues from the fields of biology, kinesiology, psychology, and environmental science. During the ninety minute session with these representatives of our client departments, we asked the questions listed in Section 2.2. The questions were posed as prompts to discussion periods where a member of the statistics faculty primarily acted as a facilitator: taking notes, asking for additional details and prompting for points of clarification. The discussion yielded fruitful conversation; the detailed summary is found in .

Table 5 Highlights from focus group meeting with colleagues from client departments.

The focus group provided strong insights for us to consider when we engage students from many scientific disciplines in our service courses. We are not proposing that we completely restructure our course based this input, only that we acknowledge what is viewed as important by client departments when evaluating service course curricula. A major theme that came out of the conversation was that hypothesis testing and p-values are used widely in their fields, so students need a better understanding of the ASA principles. The focus group would also like to see a more thorough coverage of additional concepts such as the fundamentals of probability, model fitting, effect size estimation, and Bayesian methods in our curricula for future STEM students. Lastly, the personal feelings toward statistics by the faculty members from other disciplines were illuminating and encouraging. Several of them divulged an animosity or an ambivalence toward statistics in their undergraduate educations, but later came to recognize the value in their careers during graduate study. This revealed the urgent need to provide stronger motivation for STEM students in an undergraduate statistics course. We should strive to motivate why the statistical methods they are learning are necessary within their respective fields, and demonstrate the value that statistics provides within science. A final benefit of this meeting with representatives from other departments is that this strengthens the connections between statistics departments and client departments.

4 Discussion

We encourage statisticians in academia to evaluate the coverage of p-values and hypothesis testing in their curriculum, in light of the ASA p-value statement. Within our statistics department, we conducted an audit of our algebra-based introductory course using our broad rubric to evaluate how effective the curriculum is at conveying the six principles from the ASA p-value statement. We then looked outside the department, through a focus group conversation with colleagues across several scientific disciplines. Through these activities, we have formed a better understanding of the coverage of p-values and hypothesis tests within our introductory curriculum and have developed a set of action items to remediate areas where we can improve student learning outcomes. Following the research of Dewey (Citation1933), we feel that the critical reflection on the state of our inference curriculum for our introductory service courses will make us more aware of how we approach these concepts in our other courses. We propose the framework that we used in our reflective process as a basic template that can be adapted and implemented at other universities.

Several challenges remain for fully improving the quality of p-value understanding in introductory courses. At many institutions, introductory statistics courses are staffed by temporary instructors or visiting faculty, who may lack experience in the teaching or study of statistics. In addition, lab sections are often facilitated by graduate assistants, many of whom are new to statistics. Continuing research methodology courses in other departments are also often taught by non-statistician methodological instructors who may hold misconceptions about significance and p-value interpretations (Harradine, Batanero, and Rossman Citation2011). This suggests that instructor preparation is a necessary component for improving the teaching of p-value concepts. We need to have a continuing process of training the instructors for our introductory classes. A great set of resources for staying current with instructional practices can be found through the ASA Section on Statistics Education (community.amstat.org/statisticaleducationsection), the consortium for the Advancement of Undergraduate Statistics Education (causeweb.org) and the Statement on Qualifications for Teaching an Introductory Statistics Course by the ASA/MAA Joint Committee on Undergraduate Statistics (JCUS) (Citation2014).

Another challenge for introductory statistics is that the ASA statement encourages that p-values be framed within a wider range of methods. We cannot expect students new to statistics to appreciate the value of robust analyses and reporting without some real exposure to those methods. For this, we need to find space in the curriculum—a scarce commodity in any statistics course—to cover new analysis methods, such as model selection, bootstrapping, generalized linear models, or Bayesian methods. Including a broader context of p-values and/or reinforcing the introductory curriculum will almost certainly come at the cost of another topic, and clearly not all of these topics can be added to an introductory statistics class. While the particular weight given in the curriculum is subjective, it should be heavily considered when striking the balance, given the general consensus in statistics education that a proper understanding of inference is a foundational learning objective in introductory statistics (Cobb Citation2007; Garfield and Ben-Zvi Citation2008; Rossman Citation2008; Carver et al. Citation2016). Realistically, we concede that a truly robust understanding of statistical analyses is developed through a continued statistical education beyond the introductory course. While this is a given for the future statisticians in our classrooms, our discussions with colleagues from the focus group made it clear that we need to provide consistent motivation, starting in the earliest service courses, on how statistical foundations—such as a correct understanding of p-values—are valuable across all scientific fields. We agree with Harradine, Batanero, and Rossman (Citation2011) who argued that with students understanding of inference, “the underpinning ideas need to be developed over years, not weeks.”

As statistics educators, we need to recognize the potential of our service courses to inform the next wave of scientists and scholars about the correct interpretation and fundamental use of p-values. Statisticians should also strive to reinforce the value that statistical analyses bring to general scientific inquiry by actively engaging in conversation with our peers in other scientific professions.

Acknowledgments

The authors thank our collaborators—Drs. Tom Crist, Joe Johnson, Jonathan Levy, Hank Stevens, and Rose Marie Ward—for representing their respective scientific disciplines and providing insights in the curricular focus group discussed. We also thank the anonymous referees and associate editor for the helpful suggestions they provided on the first version of this article.

References

  • ASA/MAA Joint Committee on Undergraduate Statistics (JCUS) (2014), “Qualifications for Teaching an Introductory Statistics Course.”, American Statistical Association and Mathematical Association of America Joint Committee on Undergraduate Statistics, available at amstat.org/asa/files/pdfs/EDU-TeachingIntroStats-Qualifications.pdf.
  • Batanero, C. (2000), “Controversies Around the Role of Statistical Tests in Experimental Research,” Mathematical Thinking and Learning, 2, 75–97. DOI: 10.1207/S15327833MTL0202_4.
  • Berry, D. A. (2016), “P-Values Are Not What They’re Cracked Up to Be, Online Discussion: ASA Statement on Statistical Significance and P-values,” The American Statistician, 70, 1–2.
  • Carver, R., Everson, M., Gabrosek, J., Horton, N., Lock, R., Mocko, M., Rossman, A., Rowell, G.H., Velleman, P., Witmer, J., and Wood, B. (2016), “Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report 2016,” American Statistical Association, available at amstat.org/education/gaise.
  • Cobb, G. W. (2007). “The Introductory Statistics Course: A Ptolemaic Curriculum?” Technology Innovations in Statistics Education, 1, 1–16, available at escholarship.org/uc/item/6hb3k0nz
  • Cobb, G. W. “ASA Statement on P-Values: Two Consequences We Can Hope For,” Online Discussion: Official Supplement to ASA Statement on Statistical Significance and P-values,” The American Statistician, 70, 1.
  • DelMas, G., Joan, G., Ooms, A., and Chance, B. (2007), “Assessing Students’ Conceptual Understanding After a First Course in Statistics,” Statistics Education Research Journal, 6, 28–58.
  • Dewey, J. (1933), How We Think: A Restatement of the Relation of Reflective Thinking to the Educative Process, New York: D.C. Heath.
  • Falk, R., and Greenbaum, C. W. (1995), “Significance Tests Die Hard: The Amazing Persistence of a Probabilistic Misconception,” Theory & Psychology, 5, 75–98. DOI: 10.1177/0959354395051004.
  • Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and Altman, D. G. (2016), “Statistical Tests, P-values, Confidence Intervals, and Power: A Guide to Misinterpretations,” European Journal of Epidemiology, 31, 337–350. DOI: 10.1007/s10654-016-0149-3.
  • Goodman, S. N. (2016), “The Next Questions: Who, What, When, Where, and Why? Online Discussion: Official Supplement to ASA Statement on Statistical Significance and P-values.” The American Statistician, 70, 1–2.
  • Garfield, J., and Ben-Zvi, D. (2008), Developing Students’ Statistical Reasoning: Connecting Research and Teaching Practice, Berlin: Springer Science & Business Media.
  • Harradine, A., Batanero, C., and Rossman, A. (2011), Students and Teachers Knowledge of Sampling and Inference in Teaching Statistics in School Mathematics—Challenges for Teaching and Teacher Education. A Joint ICMI/IASE Study: The 18th ICMI Study, New York: Springer.
  • Ioannidis, J. P. (2005), “Why Most Published Research Findings are False,” PLoS Medicine, 2, e124. DOI: 10.1371/journal.pmed.0020124.
  • Millar, M. (2016), “ASA Statement on P-values: Some Implications for Education. Online Discussion: Official Supplement to ASA Statement on Statistical Significance and P-values,” The American Statistician, 70, 1.
  • Nickerson, R. S. (2004), Cognition and Chance: The Psychology of Probabilistic Reasoning, New York, NY: Psychology Press.
  • Open Science Collaboration (2015), “Estimating the Reproducibility of Psychological Science,” Science, 349, aac4716.
  • Rossman, A. J. (2008), “Reasoning about Informal Statistical Inference: One Statistician’s View,” Statistics Education Research Journal, 7, 5–19.
  • Stangl, D. (2016), “Comment. Online Discussion: Official Supplement to ASA Statement on Statistical Significance and P-values,” The American Statistician, 70, 1.
  • Wasserstein, R. L., and Lazar, N. A. (2016), “ASA Statement on P-values: Context, Process, and Purpose,” The American Statistician, 70, 129–133. DOI: 10.1080/00031305.2016.1154108.
  • Weir, K. (2015), “A Reproducibility Crisis? The Headlines Were Hard to Miss: Psychology, They Proclaimed, is in Crisis,” Monitor on Psychology, 46, 39.
  • Williams, A. M. (1999), “Novice Students’ Conceptual Knowledge of Statistical Hypothesis Testing,” in Making the Difference: Proceedings of the Twenty-Second Annual Conference of the Mathematics Education Research Group of Australasia, 554–560.