2,158
Views
2
CrossRef citations to date
0
Altmetric
Editorial

The Growing Importance of Reproducibility and Responsible Workflow in the Data Science and Statistics Curriculum

Pages 207-208 | Published online: 18 Nov 2022

Modern statistics and data science uses an iterative data analysis process to solve problems and extract meaning from data in a reproducible manner. Models such as the PPDAC (Problem, Plan, Data, Analysis, Conclusion) Cycle (n.d) have been widely adopted in many secondary and post-secondary classrooms (see the review by Lee et al. Citation2022). The importance of the data analysis cycle has also been described and reinforced in guidelines for statistics majors (ASA Curriculum Guidelines 2014), undergraduate data science curricula (ACM Citation2021), and in data science courses and teaching materials (e.g., Wickham and Grolemund Citation2022).

In 2018, the National Academies of Science, Engineering, and Medicine’s “Data Science for Undergraduates” consensus study (NASEM 2018) broadened the definition of the data analysis cycle by identifying the importance of workflow and reproducibility as a component of data acumen needed in our graduates. The report noted that “documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability.” The report also tied issues of reproducibility and workflow to the ethical conduct of science.

The importance of others being able to have confidence in our findings is built into the foundations of statistics and data science (Parashar, Heroux, and Stodden Citation2022). For instance, in theoretical research, theorems are introduced along with their proof. As statistics has changed to rely more on computational methods, innovation is needed to ensure that the same level of rigor characterizes claims based on data and code. Efforts to foster reproducibility in science (NASEM 2019; Parashar, Heroux, and Stodden Citation2022) and to accelerate scientific discoveries (NASEM 2021) have highlighted the importance of reproducibility and workflow within the broader scientific process.

Robust workflows matter. For instance, COVID-19 counts in the United Kingdom were underestimated because the way that Excel was used resulted in dropped data (Kelion Citation2020). The economists Carmen Reinhart and Kenneth Rogoff made Excel errors that resulted in miscalculated GDP growth rates (Herndon, Ash, and Pollin Citation2014). Cut and paste errors are all too common in many workflows (Perkel Citation2022). The reproducibility crisis that was first identified in psychology is now known to afflict much of the physical and social sciences. Steps taken to address this crisis, including improved reporting of methods, code and data sharing, version control, are increasingly common (Munafò et al. Citation2017), but workflow and reproducibility issues can be subtle.

Where does this leave statistics and data science education? Many instructors incorporate their research in teaching and research-led teaching (Schapper and Mayson Citation2010). It is often neglected, however, that most time in research involving data science and statistics is spent on tidying up, documenting data provenance, improving group collaboration and sharing, anonymizing data, and creating analytic datasets as well as repository and replication files. While it is widely advocated that open science practice should be embedded in everyday research practice (Sandve et al. Citation2013), it is less clear why on the pedagogy side we should give most of the classroom attention to the middle part of the project—data analysis—and fail to incorporate the entire data analysis cycle (De Veaux et al. 2017; Wickham and Grolemund Citation2022). Future scientists need to have multiple opportunities to undertake the entire data analysis cycle with real data and appropriate workflows. Teaching about reproducibility and workflow aligns well with the research-led approach.

However, there are many barriers that make it challenging to incorporate sound workflow and reproducible analysis into our courses and programs. These include the rapid and constant evolution of technology and tools, the minimal training that most instructors received in the use of reproducible methods, the lack of well-established best practices, the paucity of vetted and inclusive curricular materials, and minimal comprehension of important aspects of student understandings (or misunderstandings) when we teach about the data analysis cycle, workflows, and reproducibility.

To highlight work in this important and developing area, the Journal of Statistics and Data Science Education invited papers related to “Teaching reproducibility and responsible workflow.” The November, 2022 issue of the journal is devoted to this important topic. We are excited at what the community brought forward in these 11 papers. It’s our hope that the collected papers in this issue and related resources provide motivation, guidance, and examples that complement prior published work (see, e.g., Çetinkaya-Rundel and Ellison Citation2021; Smith, Yu, and Schmid Citation2021).

Integrating reproducibility into our practice and our teaching can seem intimidating initially (Monajemi et al. Citation2019). One way forward is to start small. Make one small change to add an element of exposing students to reproducibility in one class, then make another the next semester. Our students can get much of the benefit of reproducible and responsible workflows even if we just make a few small changes in our teaching. These efforts will help them to make more trustworthy insights from data. If it leads, by way of some virtuous cycle, to us improving our own practice, then even better! Improving our teaching through providing curricular guidance about reproducible science will take time and effort that should pay off in the long term.

We look forward to seeing your innovations, and welcome future submissions on these important issues.

Beyond these papers, the interactions between the reviewers, associate editors, and authors over the past year and a half have led us to review policies and expectations for data and code sharing for the journal. As a result of these discussions and deliberations, the Journal of Statistics and Data Science Education has joined other American Statistical Association journals (e.g., JASA Citation2022) that require submission of code, data, and the workflow to reproduce the paper. New submissions to the journal now require a data availability statement from authors that indicate how the data and code that are associated with a paper have been made accessible. These resources have the potential to be helpful for the review process and for readers of that paper—that is, will allow reusing or adapting them to other courses. We believe that these policies, which are consistent with broader guidance about data and code sharing (e.g., Nosek et al. 2016), will be of benefit for the entire community.

Acknowledgments

We thank Micaela Parker for her work as one of the guest editors for the issue as well as the many reviewers who provided helpful feedback as part of the editorial review.

References