78
Views
0
CrossRef citations to date
0
Altmetric
Research Articles

Building Better Theories: Prediction Intervals as a Tool for Theory Testing and Improvement

Pages 146-168 | Published online: 16 Feb 2024
 

Abstract

The precision with which a theory predicts behavior speaks both to the quality of that theory and to its potential utility in real-world applications. Unfortunately, predictive precision is frequently overlooked when evaluating theories in social psychology. Here we call attention to the benefits of placing more emphasis on predictive precision. We define what we mean by predictive precision, consider its role in theory evaluation from a philosophy of science perspective, and suggest a method for quantifying it: by constructing prediction intervals. A prediction interval is the range of values within which an empirical observation (e.g., a sample mean) is expected to fall some specified percentage (e.g., 95%) of the time. The size of a prediction interval reflects the degree of fluctuation in the empirical observations expected by theory, and so the precision of its prediction. Prediction intervals are useful because they simplify theory testing and provide a metric by which theory improvement—and so scientific progress—can be gauged. Prediction intervals are easily created when theories are expressed formally as agent-based models. We illustrate the process of creating and using prediction intervals in three detailed examples, each involving a different agent-based model concerned with the behavior of small interacting groups.

Acknowledgements

We would like to thank Jeff Huntsinger, Stellan Ohlsson, Gary Stasser, and Young-Jae Yoon for their helpful comments on earlier drafts of this paper.

Disclosure statement

The authors declare that there is no conflict of interest with respect to the publication of this article.

Box 1. Steps for creating and using prediction intervals.

  1. Set the ABM’s parameters to values that best fit the target human study.

  2. Run the ABM n times, where n is the number of observations in the relevant condition(s) of the target human study. Collectively, these n runs constitute a single simulation of that target study.

  3. Treat each output variable from Step 2 in the same way the relevant DV from the target human study is treated (e.g., compute the mean across the n runs, compute the proportion of the n runs in which a particular value occurred, etc.). These constitute the ABM’s predictions for that run.

  4. Repeat Steps 2 and 3 a large number of times (e.g., 10,000), storing the predictions from each repetition.

  5. Sort in numerical order the predictions generated in Step 4. Do this separately for each output variable.

  6. For each sorted output variable, eliminate the top and bottom 2.5 percent of the predictions. The limits of what remains constitute the 95% prediction interval (inclusive) for that variable.

  7. For each variable, compare the result from the target human study to its 95% prediction interval. If the target study result lies within that interval, then it should be judged as being consistent with what the ABM typically predicts, and a “fail-to-reject” conclusion is appropriate regarding the theory expressed by that ABM. On the other hand, if the target study result falls outside its 95% prediction interval, then it should be judged as being inconsistent with what the ABM typically predicts, and a “reject” conclusion is appropriate regarding the theory expressed by that ABM. No further statistical analysis is required to draw either conclusion.

Notes

1 The term “prediction interval” is also used in the statistical forecasting literature, but refers there to the range of likely future observations as determined by statistical analysis of past observations (e.g., Geisser, Citation1993; Hyndman & Athanasopoulos, Citation2018; Patil et al., Citation2016). Our usage, by contrast, focuses on the range of likely future observations predicted by theory. Also, Vanpaemel (Citation2020) introduced a set of ideas roughly parallel to those presented here, but that apply to models expressed in strictly mathematical terms. By contrast, we focus on their application in agent-based modeling, which for many is a more accessible approach to theoretical formalism.

2 ABMs can be used for purposes other than prediction vis-a-vis specific empirical targets (e.g., Edmonds et al., Citation2019; Epstein, Citation2008). However, these are beyond the scope of the current paper and are not considered here.

3 Two of the original models were written in general purpose programming languages that are no longer supported, and none were written in a way that easily accommodates the construction of prediction intervals. Thus, it was primarily for the sake of expedience that each model was recreated in NetLogo expressly for this project.

4 These materials can all be download from https://osf.io/537v2/. Each ABM also requires the NetLogo modeling environment in order to run, which is freely available at https://ccl.northwestern.edu/netlogo.

5 DISM-DG is concerned only with the initial entry of information into discussion. Hence, for the sake of simplicity, repetitions of already-mentioned facts are not allowed.

6 This is true as well for any agent that, in the prior round, either failed to recall a fact or happened by chance to retrieve the same just-spoken fact.

7 The size of the prediction intervals shown in expand in the final few discussion sequence positions because the predictions there are based on progressively fewer group discussions. This mimics what actually happened in the Larson et al. (Citation1998) study: 7 or more facts were mentioned in all 48 group discussions, 12 or more were mentioned in 30 discussions, and 14 or more were mentioned in just 15 discussions. The gradual loss of groups later in the discussion sequence is simulated in DISM-GD by the probabilistic way in which discussion is terminated (see Appendix). Predictions based on fewer simulated group discussions per study generally have wider prediction intervals, implying less predictive precision.

8 It is possible, on the other hand, for the confidence interval surrounding a sample statistic to overlap an ABM’s prediction interval even though the sample statistic itself falls outside that interval. The temptation in this case might be to draw the same fail-to-reject conclusion as would be drawn had the sample statistic fallen inside the prediction interval. We recommend against this practice, however, as it undermines the logic of prediction intervals. Due to the probabilistic nature of ABMs, sampling variability is already built into their prediction intervals. Thus, the more conservative—and appropriate—approach is to ignore confidence intervals altogether, and focus instead simply on whether or not the sample statistic itself falls within the relevant prediction interval.

9 They also increase the probability of retrieving in a subsequent round a positive semantic associate of the spoken item, and so cross-cuing.

10 The predictions for individuals are of little interest beyond confirming our choice of parameter settings.

11 Unlike Examples 1 and 2, here there are no specific facts to recall and speak. Rather, it is the act of speaking itself, not what is spoken, that is of interest.

12 These three datasets, along with the NetLogo program used to compute the group state and state transition relative frequencies observed in them, are available on the OSF website noted earlier.

13 It is beyond the scope of this paper to describe such an analysis in any detail, but see Wagenmakers, Marsman, et al. (Citation2018) for a useful conceptual guide, and Wagenmakers, Love, et al. (2018) for an overview of relevant software.

14 Although each of our three examples involve multiple variables and/or treatment conditions (see especially and ), none of the reject/fail-to-reject conclusions we drew there depended in any way on comparisons between those variables or conditions. This does not mean that including multiple variable and treatment conditions is uninformative. To the contrary, it can be quite informative, in the sense that doing so provides additional opportunities to test an ABM via the prediction interval approach. But each such test involves an independent assessment the sample’s location on the relevant scale of measurement vis-à-vis the ABM’s prediction interval, not vis-à-vis another variable or treatment condition.

15 Random retrieval can be viewed a “baseline” model against which more psychologically principled approaches (e.g., preference-consistent retrieval) might be compared (e.g., Stasser, Citation1988).

16 In our experience, this “prompting” is most productive when an ABM is first being developed, for that is when alternative implementations, each with their own unique auxiliary assumptions, are most likely to be considered. Still, the specificity of the code can help anyone, even after the fact, infer the auxiliary assumptions it might imply.

Additional information

Funding

A portion of this work was supported by National Library of Medicine grant RO1-LM05481 and National Science Foundation grant SBR-9809207 to the first author.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 320.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.