5,794
Views
3
CrossRef citations to date
0
Altmetric
Data Sciences

Diagnosing Data Analytic Problems in the Classroom

ORCID Icon, , , ORCID Icon & ORCID Icon
Pages 267-276 | Published online: 11 Oct 2021

Abstract

Teaching data analysis by providing students with real-world problems and datasets allows students to integrate a variety of skills in a situation that mirrors how data analysis actually works. However, whole data analyses may obscure the individual skills of data analytic practice that are generalizable across data analyses. One such skill is the ability to diagnose the cause of unexpected results in a data analysis. While experienced analysts can quickly iterate through a series of potential explanations when confronted with unexpected results, novice analysts often struggle to figure out how to move forward. The goal of this article is to describe an approach to teaching students skills in diagnosing data analytic problems. The exercise described here is targeted to allow students to practice this skill and to assess the depth of their knowledge about the statistical tools they have learned. We take a hypothetical case study approach and focus on the students’ reasoning through their diagnoses and suggestions for follow-up action. We found the implementation of this exercise in a small graduate course to provide valuable information about the students’ diagnostic thought processes, but further work is needed regarding structured approaches to implementation and the design of assessments. Supplementary materials for this article are available online.

1 Introduction

The teaching of statistics and data analysis has gone through a massive transformation over the past 20 years. The American Statistical Association’s Curriculum Guidelines for Undergraduate Programs in Statistical Science (ASA Undergraduate Guidelines Workgroup Citation2014) and the Guidelines for Assessment and Instruction in Statistics Education (Carver et al. Citation2016) both emphasize the need for real problems, real datasets, and exposure to the entire investigative cycle of data analysis. Numerous other recommendations have emphasized computing and the idea of “thinking with data” (Nolan and Lang Citation2010; Hardin et al. Citation2015). A key theme that has emerged is the need to expose students to real data and their associated challenges (Donoghue, Voytek, and Ellis Citation2021; Nolan and Temple Lang Citation2015). We have taken this approach in our own courses and have attempted to give students as many opportunities as possible to engage in the process of analyzing a whole dataset. In general, the field of statistics has moved away from thinking about statistics education as an assortment of tools and methods toward a more integrative approach that focuses on the scientific method and its relation to statistical analysis (ASA Undergraduate Guidelines Workgroup Citation2014).

Much effort has been given to the study of data analysis and to decomposing this complex task into identifiable components. Techniques such as cognitive task analysis (Clark et al. Citation2007) have been used to identify key stages in a data analysis in order to formally characterize the statistical reasoning process. For example, Lovett identified seven key steps in a cyclical process of exploratory data analysis, including “translate question into statistical terms,” “identify relevant variables,” etc. (Lovett Citation2001). Detailed examinations of complex tasks have been leveraged to redesign introductory courses in statistics and data analysis (Lovett and Greenhouse Citation2000; Greenhouse and Seltman Citation2018) and other fields (e.g., Ríos et al. Citation2019). In related work, Grolemund and Wickham propose a broad theory of data analysis that similarly identifies several key stages in an interative process of “sensemaking” where observed data update mental models for describing the world (Grolemund and Wickham Citation2014).

As an approach to teaching data analysis, giving students an open-ended question and dataset is not without its drawbacks. From the start, students must be sufficiently informed about the scientific background in order to make any sense of the dataset handed to them. The instructor must be sure to present a problem that is relevant while also being solvable with available data. From the instructor’s perspective, datasets must be identified that satisfy a number of specific criteria that make them useful for teaching purposes. In training situations where the goal is to reinforce specific skills that can be mapped to previously identified stages of the data analytic process, it can be useful to have an activity or exercise that is targeted at that skill. A targeted exercise can be repeated efficiently and allow students to practice that skill and get feedback.

The aim of this article is to build on the previous work decomposing the data analytic process into key tasks and to focus specifically on one such task: the interpretation of statistical output and the diagnosis of unexpected findings or anomalies. Grolemund and Wickham note that observations that do not match our mental models for the world are important for building information and updating our mental models. However, a key to deciding whether one should update their mental model or reject the observed data is developing an explanation for why an anomaly might have occurred. Properly observed data may invalidate an existing mental model; however, failures in the data collection process can result in flawed data, which may be an indicator of other problems. The data scientist must be able to distinguish between these different scenarios in order to make an appropriate decision on next steps. Diagnosing the cause of unexpected results is an important skill in any interative data analysis and is a key aspect of becoming a critical consumer of statistical results in a wide variety of settings (Carver et al. Citation2016).

The overall goal of this article is to (i) identify a general data analytic skill that we felt was valuable for developing proficiency in data analysis; (ii) present didactic material to describe a formal methodology for applying this skill to data analysis problems; and (iii) design a focused assignment to directly assess this skill and to provide students an opportunity to discuss their reasoning. In this article, we describe an exercise developed for the purpose of giving students the opportunity to practice their data analytic diagnosis skills.

Our intention was to develop material that would be valuable to an instructor interested in specifically targeting students’ diagnosis skills. We consider the primary contribution of this paper to be the design and format of the exercise (Section 2) and the specific case studies themselves (Appendices B and C.6). We implemented this exercise in a small graduate course and collected some valuable information, but evidence was limited by the lack of any formal assessment.

1.1 Task Identification: Diagnosing Unexpected Outcomes

One problem that we as instructors in data analysis have repeatedly encountered is the limited ability of novice data analysts (i.e., students) to diagnose the root cause of an unexpected outcome in their analysis. Upon seeing an unexpected result, experienced analysts and data scientists, who have typically seen a wide variety of anomalies before and are knowledgeable of the statistical tools they are using, can often quickly come up with a list of potential explanations as part of the natural flow of analysis. These explanations are then explored further and a revised analysis may be proposed as a result. This iterative cycle of analysis is embedded within the model proposed by Grolemund and Wickham (Citation2014) and, while easily executed by experienced analysts, can be a stumbling block for novices attempting to produce a useful analysis.

In order to make decisions about what to do next in an analysis and to move the analysis forward, a data analyst has to be able to map out the decision space based on the observed output from the analysis. Describing the root causes of any unexpected outcomes is a key part of constructing this decision space. We felt that as part of data analysis training, the exercise of this reasoning process is at least as important as the ability to assemble statistical methods and apply them to a dataset. Therefore, we chose to focus on “diagnosing the causes of unexpected outcomes” and to design materials around this skill.

The exercise described in Section 2 builds on the iteration described by Grolemund and Wickham (Citation2014) and exposes one part of the iteration, the part where an analyst decides whether and how to update their schema for explaining the world. The aim was to give students an opportunity to (i) interpret statistical output within a specific (albeit hypothetical) context, (ii) diagnose possible causes of the discrepancy between the output and the stated expectation, and (iii) specify next steps to resolve the discrepancy. In a real data analysis, the student would then execute the next steps on the data and repeat the process in an interative fashion. The aim here was to dissect this iterative process in order to allow students to practice going from Steps 1 to 2 and then going from Steps 2 to 3.

2 Methods

We first wrote didactic material describing a systematic approach for diagnosing unexpected outcomes in data analysis. Briefly, we described a five-part approach that the students should follow:

  1. State the unexpected outcome in terms of what output from the analysis is being considered, how it deviates from our expectations, and when it is observed to occur.

  2. Reconstruct the entire sequence of events (or as much as possible based on available information) that occurs leading up to the unexpected outcome. This can be a sequence of code statements or a more abstract system diagram. The reconstruction can be detailed or not depending on how much information is available. With a reproducible analysis, it should be possible to reconstruct all of the steps.

  3. Starting with the output, trace back through the system diagram or sequence of code statements and enumerate any possible sources of error and their likelihood of occurring. This process may branch off in different directions at each stage of the system diagram/code.

  4. Stop the process once we either reach an explanation that lies outside the system or we have identified a set of root causes that is not worth developing any further.

  5. Summarize the root causes.

Further details about the lecture material can be found in Appendix A. Due to the asynchronous nature of the course in which this exercise was included, the lecture material was delivered solely as a reading.

2.1 Problem Formulation

A challenge presented here is that the notion of an “unexpected outcome” depends on two aspects: (i) the student’s expectation for what the outcome of an analysis should be; and (ii) the observed outcome of the analysis. In order to directly assess a student’s ability to diagnose an unexpected outcome, we would need to design materials where these two aspects were carefully controlled. While datasets can be found that (likely) present unexpected findings to students if they take typical approaches, the desired outcome of an unexpected result would not be guaranteed because it would depend on the student’s chosen approach.

Rather than design a whole data analysis that may or may not lead to an unexpected outcome, we decided to take the approach of designing hypothetical case studies with no datasets. We found that each of us as instructors could describe numerous situations where we encountered an unexpected outcome in our own data analyses. Therefore, we chose simply to describe the unexpected result and the circumstances that led to it. That way, we could control both the nature of the student’s expectations and the “observed” result simultaneously.

We implemented three case studies for this exercise, each with the same structure. The details of the case studies become more complicated as they progress from Case Study 1 through Case Study 3. Each case study poses a hypothetical data analytic situation with the following structure:

  • a brief introduction and description of the data analysis problem along with some rationale for the statistical methods being applied to the data;

  • a natural language description of the data analytic system being applied to the data;

  • a description of the (hypothetical) expected outcomes from the system; and

  • a description of the (hypothetical) observed outcome from the system.

There are no datasets attached to the case studies and the students had to make their assessments based on the written descriptions. The three case studies used for this exercise are shown in Appendices B and C.6.

The case studies are not designed to have a “correct” answer. Rather, the goal here is to give a student the opportunity to reason through a problem and to describe all of the possible causes of an unexpected result. A second goal is to expose students to diverse data analytic perspectives, which was made possible by the group discussions (Section 3.4). In particular, we hope to be able to expose to students “what they didn’t realize they didn’t know” about possible causes of unexpected data analytic results.

2.2 Implementation

We co-teach a course titled “Advanced Data Science,” which is a required semester-long course for master’s and PhD students in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. This course also is open to quantitatively oriented students outside of the department and is regularly taken by PhD and master’s students in the Johns Hopkins School of Public Health and School of Medicine. The prerequisites for this course include a year of biostatistical theory and methods and intermediate knowledge of programming in R (programming in R is not taught in the course). As such, the course is usually taken by students in the second year of the graduate training. The course enrolls 20–30 students in a typical year.

The current overall design of the course focuses on teaching data science methods in the first half of the semester. Methods include tools for data wrangling, high-dimensional data visualization, exploratory data analysis, and machine learning. Throughout the course, the students are given assignments where they must analyze real datasets using the tools learned in the course. Topics such as communication, presentation techniques, and reproducible research are also covered.

The exercise described in this article was presented toward the end of the course (week 13 out of 16). The reason for placing this content later in the course was that for this exercise, students would be expected to integrate their knowledge of the behavior of various statistical methods and software packages in order to identify the potential cause of unexpected outcomes. While some of the students came to the course already experienced in data analysis, for some students this course gave them their first opportunity to analyze real data. Therefore, we felt it would be more productive to present the exercise after the students had all gained more experience operating various statistical tools and doing some data analyses.

The students were given the lecture/reading, the three case studies, and one week to write up their diagnoses. For each case study, the students were asked to diagnose the root causes of the unexpected outcome when applying the hypothetical data analytic system to the data, based on their understanding of the system description and their knowledge of how the underlying system components operate. They were also asked to specify the following recommendations:

  1. Describe a single plot or summary you might produce next in order to provide further information on the nature of the unexpected outcome;

  2. Describe what you would do next (if anything) or what you might recommend be modified about the data analytic system in response to your diagnosis.

The students were given an R Markdown template of each of the case studies and asked to complete the homework by filling in the blank areas of the R Markdown file. The format of the R Markdown file is shown in Appendices B–C.6. Collaboration was not discouraged, but each student was asked to produce their own write up.

In addition to writing up their diagnoses, the students were also asked to prepare to discuss their answers in a small group moderated by an instructor or teaching assistant. We divided the students into groups of 3–4 and had them discuss their reasoning for how they arrived at the root causes and the follow-up actions.

For this initial pilot of this exercise we did not implement a specific mechanism for encourging the students to do the reading (such as a brief quiz). As this was a small graduate-level course with which we had significant experience, we had some confidence in the students’ engagement with the course generally and with the reading specifically. However, in future iterations of this exercise we may take a different approach; we discuss this further in Section 4.

This study and the summary of the classroom exercise was approved by the Institutional Review Board of the Johns Hopkins Bloomberg School of Public Health (IRB #15332).

2.3 Prerequisite Knowledge

The cases presented in this implementation of the exercise were designed for a second-year graduate class and presumed knowledge of generalized linear models and a level of R programming knowledge consistent with having completed an introductory R course. The primary aim of the exercise was to probe the students’ knowledge of statistical methods and their respective R software implementations. That said, there is nothing about the exercise that is specific to second-year graduate students. Some knowledge of data analysis techniques is also needed in order for students to be able to integrate their diagnoses across the data, methods, and software. In the current implementation of this exercise, data analysis was covered in the first half of the overall course.

We feel the procedure outlined in Section 2 could be presented in an undergraduate- or introductory-level course and that a version of this exercise could be adapted for a data analysis course at any level. The case studies would need to be altered to reflect the methodology and software that the students had learned in their previous training. For example, introductory courses at our institution primarily use Stata software (StataCorp Citation2021) and we believe it would be possible to develop a case study that would probe students’ knowledge of statistical methods using that software package.

2.4 Primary Outcomes and Evaluation

Our primary outcome was the set of possible root causes identified by the student to explain the unexpected outcome from the hypothetical analysis. A secondary outcome was the student’s recommendation for next steps, such as a plot or summary statistic to calculate, or any modifications to the analytic system based on their diagnosis. Here, we wanted to see that the students could map their root cause diagnosis to a hypothesis to test using a specific statistical technique. For example, if a student identified outliers as a possible cause of an unexpected outcome, then we would expect that the subsequent recommendation would be a plot or statistic that could either be consistent with or reject that explanation.

The assessment of this exercise was based on the students’ write-ups of their solutions. In their write-ups, we expected students to identify at least one root cause for each case study and to provide at least one follow-up action corresponding to that root cause. We also instructed students to describe the reasoning that lead to their proposed root causes and the rationale for their proposed follow-up action. Group discussions were not assessed and were primarily used as a way for students to hear from their peers about how they reasoned about the case studies.

Given the hypothetical nature of the case studies, it was not possible to determine if the solutions were “correct,” but we did check to see if the proposed solutions were plausible within the description of the case study. In this exercise, we did not feel it was a priority for students to achieve some definition of correctness because ultimately, we were assessing the students’ ability to move forward and specify the next step in a data analysis. In real data analyses, data analysts generally have to iterate a few times to determine if their proposed root causes are correct and different analysts will move at different paces.

3 Results

In their write-ups of their diagnoses for each case study, students generally presented one of two approaches. The first approach applied the methodology described in the reading and worked backward from the unexpected outcome, attempting to identify problems along the way. The second approach started more from “first principles” and tried to identify more general problems that might arise. The first approach appeared to result in a narrower focus on the problem resulting in a smaller selection of root causes. The more general approach often led to a wider variety of root causes, some of which were only described vaguely. We summarize the students’ responses in the sections below.

3.1 Case 1

Case 1 d

escribes computing the mean of a vector of values (generated from an air pollution monitor) after log-transforming the raw data. The unexpected outcome is that the mean function returns NaN instead of a numerical value. This case required an understanding of how the mean() function in R works and the difference between NaN and NA values.

We can connect this case study to the five-part framework presented in Section 2 in the following manner:

  1. In all of the cases, the unexpected outcome, or “anomaly” was stated in the description. In this case the anomaly is that the value of z¯ is shown to be “NaN” when printed to the console when our expectation is that the output will be a single numerical value falling in the interval [,].

  2. The reconstruction of the system that lead to the unexpected outcome is also given to the students in the “System” description. In each this case it is given as a linear sequence of R code expressions.

  3. Starting with the output value of NaN, the student must then trace back through the steps stated in the system and determine possible causes. The aim would be to start with the print function to determine if anything unexpected might have occurred there. Then we could work backward to the mean function to determine what might cause an NaN value to outputted and whether the function itself was the cause or the input to the function was the cause. From there, we can work backward to the log transform and then the process of reading in the data themselves.

  4. The diagnosis would naturally stop at the beginning of the system, which in this case was at the point of reading in the data. Problems could have originated before reading in the data, but there are no details here describing what happened previously.

  5. The summary of the possible causes would identify specific conditions that could be checked. In this case, there might be NaN values in the raw data vector or there might be negative numbers in the raw data which become NaN after the log transformation.

Most students identified possible problems with the raw data as a root cause. In particular, most realized that the most likely way a NaN could appear in the output is if the data contained negative values and the log-transform resulted in NaN. For example, one student wrote in the beginning of the diagnosis,

Working backward, it seems unlikely that the print function would misprint ‘z’, so the cause is more likely to be in steps 1-3 of the system. Likewise, the mean function is unlikely to cause an issue. Therefore, the input of the mean function must have not been numeric. This can happen for several reasons. If there are missing values that were read into R as ‘NA’ values, then the mean function returns ‘NA’. Similarly, ‘NAN’ is returned if there is an ‘NAN’ in the input.

Another possibility mentioned by a few students was that there was a NaN value in the raw data that was passed-through by the log function to the mean() function. Some students noted that missing values or outliers are sometimes coded as -99 and so the log transform of such a value would result in NaN.

A few students raised the possibility of problems originating outside the analysis itself. For example, one student wrote,

Given the nature of the variable, I would not expect negative entries for ‘x’ unless there is some systematic error. Since ‘x’ here is our raw data, I would ask the collaborators if there is are any potential sources of a systematic error (e.g., calibration error in data collection) that would result in negative data entries.

Not all students demonstrated a complete understanding of the operating characteristics of the mean() or log() functions in R. Many thought that there might be a problem reading in the data as character instead of numeric and that the mean of a character vector would result in NaN (the mean() function in fact produces NA in this case). Others thought that there might be zeros in the dataset, which might lead to the log() function producing NaN (the log() function produces -Inf in this case, for which the mean will be calculated as -Inf). One student was not clear on the behavior of the mean() function and so developed a number of test cases to determine what kinds of input would generate a NaN output.

Many students came up with general problems that might cause the observed unexpected outcome. Such diagnoses included the air pollution monitor having a faulty recording system, text entries in the dataset (instead of numbers), NA values in the dataset, and human error in applying the mean() function. While each of these diagnoses appears plausible, the students’ descriptions generally were not specific enough to explain how they would lead to an NaN value upon computing the mean.

3.2 Case 2

Case 2 d

escribes fitting a simple linear regression model to outcome y and predictor x after removing any observations that contain NA values in either x or y. The unexpected outcome is that the estimated slope β̂x=3 when we are expecting β̂x[0,2]. In addition to understanding the basics of linear regression, this case requires students to consider the impact of removing missing values on the estimation of βx . No explicit R code was given for this case.

As in Case 1, the unexpected outcome is deliberately given to the student in the description of the problem. For the description of the sequence of steps leading to the unexpected outcome, instead of R code describing the analytic system, a natural language sequence of steps was given to describe the analysis. For the diagnosis, we again expected the students to start with the output (i.e., β̂x=3) and work backward through the steps presented in the description. The primary suspicion lies in step 2, where missing observations are removed from the dataset and the pattern of missingness is not specified. Therefore, one possible diagnosis could have been that the removal of missing data created a pattern in the dataset that had not previously been seen. While there may be other (valid) causes of the unexpected outcome, they are not as explicitly presented in the description of the analysis.

Most students noted that the unexpected outcome could be a result of incorrect expectations for β̂x, that is, the simple linear regression model is wrong. Related to this was the suggestion by some students that high variability in the data (possibly due to the reduced sample size from removing missing data) could cause a negative value of β̂x to be in the range of outcomes. In both cases, the negative value for β̂x should have been expected. Also falling into this category of explanations included violations of the linear model assumptions.

The removal of missing values was also flagged as a potential problem, although for a variety of reasons. One student wrote:

A possible cause of the anomaly is that the encoding of missing values that are expected from the data collection process is not compatible with our current system. For example, if missing values are encoded as ‘“” ’ or ‘-Inf’, and not ‘NA’ as we initially expected, then the observations with missing data will not be filtered out by our current system.

Some students noted that if the missing data were not missing completely at random, then their removal could alter the estimation of the regression coefficient. Others noted that the removal of missing observations reduced the available sample size and therefore might have increased the variability in β̂x.

Students suggested a number of “human error” causes such as sorting data by a single column in Microsoft Excel without sorting corresponding columns, regressing x on y instead of y on x, using the wrong link function in the regression, or extracting the intercept coefficient from the model instead of the slope. A range of data problems were also suggested, such as outliers, contamination of the raw data, or general “errors in the data.”

3.3 Case 3

Case 3 d

escribes fitting a Poisson regression model to outcome y with predictors x and z where the focus is on the coefficient for predictor x, denoted as βx . A Normal approximation is used to construct a 95% confidence interval for βx and the unexpected outcome is that the length of the interval is much larger than one would expect. This case requires students to consider the factors that might lead to a very large confidence interval in a multivariate Poisson regression.

Once again, the unexpected outcome is presented explicitly in the description of the problem. This case gives R code to describe the analysis steps, along with a natural language description. The R code rules out some sources of error, such as extracting the wrong coefficient from the model fit. Because this case has to do with a larger than expected confidence interval, the construction of the confidence interval in Step 5 is arguably the first place to examine. The Normal approximation requires the standard error for β̂x so Step 4 is the next place to look. This case centers around the possible reasons that the standard error might be much larger than expected. Collinearity with z is a key cause, but there could be others such as small sample size.

Given that the estimate β̂x was in the expected range, most students focused their attention on the standard error of β̂x as the potential source of the unexpected outcome. Factors that could cause a standard error to be much larger than expected included outliers in the dataset, a non-Poisson mean–variance relationship where the variance is much larger than the mean, a much smaller sample size than usual (violating a requirement for the Normal approximation), and collinearity between x and z.

Potential causes that were common to the other two cases included outliers in the data, incorrect model specification, incorrect inclusion of predictors, incorrect coding of variables, and other forms of human error. Although removal of missing data was not an explicit part of the analysis, some students noted that the glm() function in R by default removes missing observations. Therefore, if there were a larger than usual number of missing values, the model would be fit to a much smaller dataset, thereby increasing the standard error.

3.4 Group Discussions

For the group discussions, our expectation was that students with more experience with data analysis would generate a wider variety of root causes than those with less experience. We found that the students enrolled in degree programs outside of the Department of Biostatistics were generally further along in their student careers and had some experience analyzing real data, whereas the students in Biostatistics were all in their second year and had mostly just finished coursework.

Across the cases, most of the students focused on issues with the data or statistical modeling. However, some people with more experience focused on coding or “human error” problems. For example, one student pointed out in Case Study 2 that it was possible that the regression model was fit to the wrong data. In other words, perhaps the variable labeled x is not what we think it is because of mislabeling or a problem of that nature. Other students tended to focus on more coding-related errors, such as subsetting the wrong column of a data frame (Case Study 1) or, as in the case of Case Study 2, indexing the wrong coefficient in the coefficient vector. Common statistical root causes cited in Case Study 3 (unexpectedly large confidence interval) was that the sample size was too small or perhaps the Normal approximation was inappropriate.

The group discussions were particularly useful to assess the depth of the students’ understanding of the answers they provided. Some students were able to narrow down the specific situations that might give rise to the unexpected outcome whereas some students were only able to provide general descriptions. In Case Study 3, many students indicated that as a follow-up action they would “plot the data.” However, when subsequently asked what type of plot they would make in order to shed light on the cause of the unexpectedly large confidence interval, not all could immediately answer. Nevertheless, after some further discussion, most students were able to articulate the type of plot they would make and what pattern they would expect to see that would be consistent with a large confidence interval. For example, one student suggested making a histogram of the response variable in order to look for large outliers.

We found in the group discussions that for each of the case studies, there was never perfect overlap between the sets of root causes identified by the students. Each student was able to come up with a unique root cause for each case study that the other students in the group had not thought of. Based on the students’ discussion and our knowledge of the students’ backgrounds, the diversity of the root causes identified appeared to reflect the diversity of experience among the students.

3.5 Analyst vs. Manager Perspective

In writing up their responses, most students took the perspective that they were the analysts doing the analysis and wrote up their responses accordingly. However, during the class discussion we asked the students to take a different perspective, one where they were managing another person who was doing the analysis. From this perspective, they would not have direct knowledge of the underlying details of the analysis. They would only know what had been presented to them in the case study.

One issue that such a hypothetical arrangement raises is trust. When an analyst evaluates their own work, there is generally little uncertainty about what was done, at least in the short term. However, in considering the work of another analyst, there may be varying levels of a priori trust based on past experience working with others, resulting in more or less uncertainty about what was actually done in the analysis. Several students noted in discussion that if this were the first time the analyst had produced something, they would ask for more follow up than if there had been a long-standing relationship.

4 Discussion

We have designed a pedogogical exercise that allows students to learn and practice an essential data analytic skill without the need for conducting a complete data analysis with code or data. The exercise involves describing a hypothetical data analytic scenario, the statistical system applied to the data, the expected outcome, and how the data analysis output deviated from that expected outcome. The student is then tasked with explaining what might have caused the output to deviate from the expected outcome. The exercise works well in small group settings because we have found that students are eager to discuss their own experiences with data analysis when diagnosing problems.

Diagnosing unexpected outcomes in data analysis is challenging because it requires students to consider two separate systems of reasoning: The system that leads to a data analytic result and the system that leads to one’s expectations about the result. The first is the sequence of steps that transforms the data into an output or result. Students typically will be familiar with this sequence as they most likely learned those steps in class. However, the second system is a sequence of decisions and ideas that leads to one’s expectations about a given data analysis. This system typically involves reading the literature, synthesizing existing evidence, or perhaps doing preliminary studies in order to develop an understanding of how the data were collected and what to expect in an analysis.

For this exercise, we instructed students not to focus on how expectations about a given data analysis were developed and to focus more on how the statistical output was generated. However, in reality, one would have to consider both systems. While scientists working in a given domain are able to build this second system, statisticians will generally not have formal training in these areas. As a result, statisticians will have to engage in some formal training, work with a collaborator in the area of study, or engage in data analysis with only a superficial understanding of the science. All three of these options present problems in a classroom setting, especially where time may be limited. The approach taken here acknowledges the importance of the scientific process of developing expectations, but does not consider it in detail. This allows students from a variety of backgrounds and interests to focus on the general data analytic skill of diagnosing problems that arise with the application of statistical tools.

Many have written previously about the iterative nature of data analysis, cycling between the application of statistical methods to data, the assessment of the output, and the subsequent application of other methods (Wild and Pfannkuch Citation1999; Grolemund and Wickham Citation2014; Greenhouse and Seltman Citation2018). We believe the process of diagnosing unexpected outcomes is a key part of this iteration. Exploratory data analysis techniques in particular are designed to maximize the possibility of finding features of the data that are unexpected (Tukey Citation1977; Chambers et al. Citation1983). However, one cannot close the iterative loop if the root causes of such unexpected results cannot be identified and investigated. Therefore, we feel there are benefits to explicitly focusing on the diagnostic aspect of data analysis and reinforcing the practice with students.

During the implementation of the exercise, in both the group discussion and in evaluating the writups, we did not emphasize obtaining a correct answer, even if one did exist, and did not penalize students if they did not understand a specific R function or statistical tool. The reason is simply because even experts have misunderstandings about various tools and the complexity and changing nature of the R language can make it difficult to have a complete understanding of everything at all times. Therefore, our focus was on observing if students could identify a potential cause and then specify a follow-up action in order to investigate. Presumably, if the root cause was incorrectly diagnosed, the follow-up action would eventually reveal that.

Although the students were not asked to execute any of their proposed follow-up actions, some did engage in additional exploration. For example, in Case Study 1 (Section 3.1), one of the students hypothesized that the mean() function caused the NaNs to appear in the output, but was not sure. That student then designed some test cases to better understand the behavior of the mean() function and determine when it does or does not produce NaN values. Hence, even though the original hypothesized cause was incorrect, the student quickly realized that upon examining the output of the test cases. For the purposes of this exercise, we felt that any root cause identified coupled with an appropriate follow-up action was a positive outcome.

While problems with data analyses can sometimes be traced back to software implementation errors, we have explicitly ignored such errors for the moment. Rather, we wanted the students to focus on the behavior of the statistical tools that they have applied and their understanding of how they should operate on data. For example, in Case Study 3, some students noted that collinearity between predictors could cause a confidence interval to be much larger than expected. Such an unexpected data analysis outcome is not due to an implementation error, but rather is the expected behavior of any linear regression software. Debugging software is an important skill for data science and it is possible that the approach we have devised here could be adapted to focus more on software implementation issues.

The concept of diagnosing unexpected data analytic outcomes is closely related to the software engineering concepts of debugging and testing (Donoghue, Voytek, and Ellis Citation2021). For example, Li et al. (Citation2019) found that successful debugging of software required knowledge that cut across a variety of areas (i.e., domain knowledge, system knowledge, and functional knowledge) and that experienced developers were able to identify problems faster than novices. A study employing a variety of knowledge elicitation methods found that while individual developers were not able to identify all of the problems in a debugging task, combining the results of 4–5 developers resulted in twice as many problems being identified (Chao and Salvendy Citation1994). We observed a parallel phenomenon in this study where students from different backgrounds appeared to identify different causes of the unexpected data analytic results in the case studies. The teaching of software debugging has many parallels to the task of diagnosing data analysis anomalies and we believe that borrowing ideas from this literature could be a fruitful direction of future work. More generally, researching on troubleshooting techniques may provide additional approaches that could be borrowed or adapted (Jonassen and Hung Citation2006).

Our approach here was carefully constructed to follow basic principles of instructional design and statistical pedagogy. Following Bransford, Brown, and Cocking (Bransford et al. Citation2000) as well as Cobb and McClain (Cobb and McClain Citation2004), our approach was knowledge centered in that the case studies were directly designed to consider and evaluate students’ diagnostic capabilities with respect to common data analytic problems. This approach is also critically learner centered because it leverages the diverse experiences and perspectives of the students. Because the students’ backgrounds were varied (particularly when considering those students from outside the Biostatistics Department), each student was able to contribute their unique experiences working with data and programming in R. The breakout group discussion portion of the assignment allowed for an assessment centered activity because students were able to directly explain their reasoning for the choices they made in both the diagnosis and in the follow-up actions. Furthermore, students were exposed to the reasoning of the other students in their breakout group which generally led to being exposed to numerous new ideas. Finally, we felt that the use of small breakout groups encouraged the participation of students who might otherwise be intimidated by making presentations to the entire class.

4.1 Lessons Learned

Overall, we felt that this small pilot study of this exercise was successful in that it provided students a useful opportunity to practice their ability to consider how statistical methods and software might produce unexpected outcomes. All of the students produced reasonable root causes for the case studies and were able to describe follow-up actions in order to investigate those causes. There was some variation in the specificity of the causes and the follow-up actions, but all students met our baseline expectation or producing one root cause and one follow-up per case.

In the implementation of this exercise we benefited from the relatively small class size and advanced graduate student population in which it was administered. The students had a fairly homogeneous training in statistical methods and we were very familiar with what they had learned on their previous courses. Therefore, we were able to tune the case studies to address what they should have already known. In other situations, where the backgrounds of the students may be more heterogeneous or even unknown, the development of some baseline knowledge may be needed before administering this kind of exercise.

One way in which we could have improved the implementation of the exercise would have been to give students some way to safely acknowledge their ignorance of a specific tool/method or to at least rate their confidence in their diagnosis. In each of the case studies, students had varying levels of understanding about the code and the methods involved and allowing students to rate their confidence in a diagnosis would have provided a much richer picture of their understanding than requiring them to definitively declare specific root causes. In addition, allowing for such a rating system could open up the possibility of encouraging the student to research or experiment with specific tools or software, in perhaps an extended version of this exercise.

A key limitation of the current implementation of this exercise is the qualitative nature of the assessment of both the written answers and the group discussion. The primary aim of this small pilot study was to give students the opportunity to discuss possible causes of the data analytic anomalies and give them verbal feedback on their assessments. To be more broadly applicable, a more formal rubric for evaluation is needed. One approach is for instructors to develop a complete set of plausible root causes for each unexpected outcome given to the students, perhaps ranked by their likelihood of occurrence. Evaluation could be accomplished by determining what portion of the complete set was covered by the submitted answer and whether the student captured the most likely causes. Chao and Salvendy (Citation1994) took an approach similar to this in their study. Jonassen and Hung (Citation2006) describe a simulator where learners are given a series of hypotheses that could explain an anomaly and learners have to choose which hypothesis to test and get feedback from the system. While implementing such a simulator or intelligent tutoring system would require substantial effort, it might work well in a large-scale setting such as a massive open online course (Kross et al. Citation2020). Another advantage of a simulator type approach is that it allows students to iterate through a series of hypotheses to eventually arrive at a primary root cause based on successive feedback. This process allows for heterogeneity in the students’ approaches rather than requiring them to identify the root cause in a single iteration.

We did not employ a specific mechanism to encourage students to engage with the reading beyond informing them that they would be asked to discuss their work in the small group discussions a week later. In retrospect, it might have been productive to employ such a mechanism. However, this was a small graduate course of PhD and master’s students who were highly engaged in the course and our experience teaching this course for many years suggested that a mechanism for encouraging the students to do the reading was not necessary. That said, a reading quiz (or something similar) might have also served to give the students some immediate feedback on their understanding of the material. Furthermore, in other settings, such as with larger classes or with students with more heterogeneous backgrounds, the approach that we took would likely not be appropriate or effective.

The implementation described here focuses on the design and presentation of the case studies, but still leaves some open questions regarding the implementation of the exercise in the classroom. Our implementation consisted of a written element and a group discussion element during the class period.

5 Summary and Future Work

Diagnosing possible causes of unexpected outcomes is a key element of the iterative nature of data analysis and is a skill that we have found to be important to being a good data scientist. In this article, we have developed an exercise targeted at practicing and assessing this skill. The case studies are relatively easy to develop in part because they do not require instructors to obtain any datasets nor require students to have significant background scientific knowledge. We feel our approach could be appropriate as a module in a larger semester-long course on data science or statistical methods.

We found the exercise to be a useful assessment of the depth of students’ knowledge about the behavior of the statistical tools and software, but the lack of a formal or quantitative assessment of student performance limited our ability to generalize our findings. As such, the results here primarily represent a proposal for how statistical diagnosis skills could be specifically targeted for practice in a homework assignment or lesson. While we encourage instructors to explore the materials we have developed here, we caution that further work is still needed to evaluate their effectiveness. Future work could consist of developing formal evaluation methods and a more quantitative approach to assessing student understanding. For example, written rubrics could allow instructors to specifically characterize which aspects of a diagnosis where accomplished and which were not. Finally, more specific guidance regarding the structure of the group discussions could be developed in order to facilitate implementation, especially in larger class settings.

Supplemental material

Supplemental Material

Download PDF (227.8 KB)

Supplementary Material

The complete lecture on diagnosing data analytic problems that was presented in the course is provided in the supplementary material.

Disclosure Statement

The authors declare no competing interests.

References

  • ASA Undergraduate Guidelines Workgroup. (2014), Curriculum Guidelines for Undergraduate Programs in Statistical Science, American Statistical Association. Available at http://www.amstat.org/education/curriculumguidelines.cfm.
  • Bransford, J. D., Brown, A. L., Cocking, R. R. (2000), How People Learn, Washington, DC: National Academy Press. Available at http://nap.edu/9853.
  • Carver, R., Everson, M., Gabrosek, J., Horton, N., Lock, R., Mocko, M., Rossman, A., Roswell, G. H., Velleman, P., Witmer, J., and Wood, B. (2016), “Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report 2016.” Available at http://www.amstat.org/education/gaise.
  • Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A. (1983), Graphical Methods for Data Analysis, Belmont, CA: Wadsworth.
  • Chao, C.-J., and Salvendy, G. (1994), “Percentage of Procedural Knowledge Acquired as a Function of the Number of Experts From Whom Knowledge is Acquired for Diagnosis, Debugging, and Interpretation Tasks,” International Journal of Human–Computer Interaction, 6, 221–233.
  • Clark, R. E., Feldon, D., van Merrienboer, J., Yates, K., and Early, S. (2007), “Cognitive Task Analysis,” in Handbook of Research on Educational Communications and Technology (Chap. 43, 3rd ed.), eds. Spector, J. M., Merrill, M. D., van Merrienboer, J., and Driscoll, M. P., New York, NY: Lawrence Erlbaum Associates.
  • Cobb, P. and McClain, K. (2004), “Principles of Instructional Design for Supporting the Development of Students’ Statistical Reasoning,” in The Challenge of Developing Statistical Literacy, Reasoning and Thinking, eds. D. Ben-Zvi and J. Garfield, New York, NY: Springer, pp. 375–395.
  • Donoghue, T., Voytek, B., and Ellis, S. E. (2021), “Teaching Creative and Practical Data Science at Scale,” Journal of Statistics and Data Science Education, 29, S27–S39.
  • Greenhouse, J. B., and Seltman, H. J. (2018), “On Teaching Statistical Practice: From Novice to Expert,” The American Statistician, 72, 147–154.
  • Grolemund, G. and Wickham, H. (2014), “A Cognitive Interpretation of Data Analysis,” International Statistical Review, 82, 184–204.
  • Hardin, J., Hoerl, R., Horton, N. J., and Nolan, D. (2015), “Data Science in Statistics Curricula: Preparing Students to ‘Think with Data’,” The American Statistician, 69, 343–353.
  • Jonassen, D. H. and Hung, W. (2006), “Learning to Troubleshoot: A New Theory-Based Design Architecture,” Educational Psychology Review, 18, 77–114.
  • Kross, S., Peng, R. D., Caffo, B. S., Gooding, I., and Leek, J. T. (2020), “The Democratization of Data Science Education,” The American Statistician, 74, 1–7.
  • Li, C., Chan, E., Denny, P., Luxton-Reilly, A., and Tempero, E. (2019), “Towards a Framework for Teaching Debugging,” in Proceedings of the Twenty-First Australasian Computing Education Conference, pp. 79–86.
  • Lovett, M. (2001), “A Collaborative Convergence on Studying Reasoning Processes: A Case Study in Statistics,” in Cognition and Instruction: 25 Years of Progress, eds. S. M. Carver and D. Klahr, New York: Lawrence Erlbaum, pp. 347–384.
  • Lovett, M. C., and Greenhouse, J. B. (2000), “Applying Cognitive Theory to Statistics Instruction,” The American Statistician, 54, 196–206.
  • Nolan, D., and Lang, D. T. (2010), “Computing in the Statistics Curricula,” The American Statistician, 64, 97–107.
  • Nolan, D., and Temple Lang, D. (2015), “Explorations in Statistics Research: An Approach to Expose Undergraduates to Authentic Data Analysis,” The American Statistician, 69, 292–299.
  • Ríos, L., Pollard, B., Dounas-Frazer, D. R., and Lewandowski, H. (2019), “Using Think-Aloud Interviews to Characterize Model-Based Reasoning in Electronics for a Laboratory Course Assessment,” Physical Review Physics Education Research, 15, 010140.
  • StataCorp (2021), Stata Statistical Software: Release 17, College Station, TX: StatCorp LLC.
  • Tukey, J. W. (1977), Exploratory Data Analysis, Reading, MA: Pearson.
  • Wild, C. J., and Pfannkuch, M. (1999), “Statistical Thinking in Empirical Enquiry,” International Statistical Review, 67, 223–248.

Appendix A

Lecture on Diagnosing Data Analytic Problems

The didactic material we developed is presented here: https://rdpeng.org/ads2020/week-13.html.

Appendix B

Case 1: Sample Mean

We are building a data analysis system to compute average levels of an ambient air pollutant in Baltimore. Our collaborators have designed a data collection system that measures ambient pollution values once per hour at an outdoor monitor and writes it to a CSV file. That CSV file then is sent to us to compute the daily mean. Because the nature of the process that generates air pollution is known to produce highly skewed data, we have decided to log-transform the data first

B.1. System

The data analytic system can be described in the following steps:

  1. Read in the data on variable X to obtain x1,,xn.

  2. Log transform the data by creating zi=logxi.

  3. Compute z¯=1ni=1nzi

  4. Output z¯ to the console

If we assume that the dataset provided is in a file dataset.csv with one column labeled x indicating our variable of interest, then some pseudo-R code for implementing the system would bedat <- read.csv("dataset.csv")

z <- log(dat$x)

zbar <- mean(z)

print(zbar)

Assume for this example that the dataset size is always n=24 (one measurement for each hour of the day) and never changes.

B.2. Expected Outcome

Our expectation is that the output will be a single numerical value falling in the interval [,].

B.3. Observed Outcome

The value of z¯ is shown to be NaN when printed to the console.

B.4. Diagnosis

Write up your diagnosis of the unexpected outcome here and summarize all of the possible root causes. Please be as detailed as possible when explaining how a given event might cause the unexpected outcome to occur.

B.5. Follow Up

  1. Describe a single plot or summary you might do next in order to provide further information on the nature of the observed unexpected outcome.

  2. Describe what you would do next (if anything) or what you might recommend be modified about the system in response to your diagnosis.

Appendix C

Case 2: Linear Regression Model with Missing Data

We are interested in fitting a linear regression model with y as the outcome and x as the predictor. We know from experience that the process that generates the data can occasionally produce missing values in either the y or x variables and these are indicated by “NA” values in the dataset. Our primary interest is in β̂x, the estimated slope coefficient for x.

The datasets that our system will be applied to will not always be of the same size, so the sample size of the raw data will not be known in advance of applying the system. From preliminary data analysis, we estimate that the standard deviation of the errors in the model will be about 5.

C.1. System

Assume that the dataset provided is in a file “dataset.csv” with one column labeled “y” and another labeled “x” indicating our variables of interest. The data analytic system can be described in the following steps:

  1. Read the dataset in.

  2. Remove any rows that contain NA values in either x or y.

  3. Fit a linear model of y using x as a predictor.

  4. Extract β̂x, the slope coefficient for x.

  5. Print β̂x to the console.

C.2. Expected Outcome

Our expectation is that β̂x[0,2].

C.3. Observed Outcome

β̂x is shown to be 3 when printed to the console.

C.4. Diagnosis

Write up your diagnosis of the unexpected outcome here and summarize all of the possible root causes. Please be as detailed as possible when explaining how a given event might cause the unexpected outcome to occur.

C.5. Follow Up

  1. Describe a single plot or summary you might do next in order to provide further information on the nature of the observed unexpected outcome.

  2. Describe what you would do next (if anything) or what you might recommend be modified about the system in response to your diagnosis.

C.6. Case 3: Poisson Regression Confidence Interval

We are interested in conducting a Poisson regression of a count outcome y on two predictors x and z. Our primary predictor is x while z could be thought of as an important confounding variable. Our primary interest is in the coefficient for the predictor x, which we call βx. As a result, we design a system that estimates βx using maximum likelihood and produces an approximate 95% confidence interval for βx.

C.7. System

The data analytic system can be described in the following steps:

  1. Read the dataset in.

  2. Fit a Poisson GLM with a log link to y using x and z as predictors

  3. Extract β̂x, the regression coefficient for x from the model fit.

  4. Extract se(β̂x), the standard error for β̂x

  5. Use a Normal approximation to compute an approximate 95% confidence interval for βx, using β̂x±1.96×se(β̂x).

  6. Output β̂x and the confidence interval

If we assume that the dataset provided is in a file dataset.csv with one column labeled y, another labeled x, and another labeled z indicating our variables of interest, then some pseudo-R code for implementing the system would bedat <- read.csv("dataset.csv")fit <- glm(y ∼ x + z, data = dat, family = poisson)beta_x <- coef(fit)["x"]se_betax <- sqrt(diag(vcov(fit)))["x"]conf <- beta_x + 1.96 * c(-se_betax,se_betax) print(c(beta_x, conf))

C.8. Expected Outcome

Our expectations are that

  1. β̂x[0,0.4];

  2. The lower limit of the confidence interval should not be less than 0.2;

  3. The upper limit of the confidence interval should not be more than 0.6.

C.9. Observed Outcome

We observe that β̂x=0.013 and that the confidence interval is [5.6,6.8] when outputted to the console.

C.10. Diagnosis

Write up your diagnosis of the unexpected outcome here and summarize all of the possible root causes. Please be as detailed as possible when explaining how a given event might cause the unexpected outcome to occur.

C.11. Follow Up

  1. Describe a single plot or summary you might do next in order to provide further information on the nature of the observed unexpected outcome.

  2. Describe what you would do next (if anything) or what you might recommend be modified about the system in response to your diagnosis.