531
Views
2
CrossRef citations to date
0
Altmetric
Articles

Evaluating the evidence in algorithmic evidence-based decision-making: the case of US pretrial risk assessment tools

ORCID Icon & ORCID Icon
Pages 359-381 | Published online: 17 Jan 2021
 

ABSTRACT

Algorithmic decision-making (ADM) promises to strengthen evidence-based decisions, particularly to better manage risks in various domains. Its use also extends to the criminal justice system where algorithmic risk assessments potentially provide very valuable evidence that can inform highly sensitive decisions. Yet, such algorithmic tools also introduce intricate problems that are tied to the fundamental question of exactly what kind and what quality of evidence they offer. This paper illustrates this problem based on a comparison of pretrial risk assessments that have been implemented statewide in the USA. The authors highlight the empirical variation in the construction, evaluation and documentation of these tools to carve out the considerable discretion involved along these dimensions. They also point to further possible ways of looking at the performance of these tools and show why evaluating the quality of the evidence delivered by algorithmic risk assessments is a far from straightforward affair.

Acknowledgements

We thank the anonymous reviewers for their very helpful comments and suggestions. Thanks also go to Malin Grüninger and to Louisa Prien for their assistance in researching the info used in the article.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

3 And it could lead to a differential treatment of groups of people. Due to a biased practice of disproportionately arresting certain groups with the same crime rate as other groups, criminal history can reinforce existing discrimination: ‘[r]acial bias in arrests leads to racial bias in risk scores’ (Eckhouse et al., Citation2019, p. 196).

4 On a more technical note, a trained statistical model of an algorithmic risk assessment tool may still be useful overall for predicting outcomes even individual predictors are not statistically significant. If many of the included predictors are correlated with each other, they share variation and thus could partly be substituted for each other. This also means that the unique variation and information that a predictor has for explaining an outcome is decreased. This multicollinearity of predictors makes it less likely that their coefficients are significant under conventional levels. If the primary aim is to build a well-performing prediction model, one could give less weight to the Type-I error, largely ignore the statistical significance of predictors and rather aim to avoid the Type-II error: accepting the null hypothesis of ‘no effect’ even if there is an effect. However, the models will be inefficient and substantively misleading because some predictors could be dropped without losing predictive power and because one does not know which predictor counts as a relevant explanatory factor. Consequently, when accepting a high Type-I error, one might keep predictors in the model that themselves have no predictive power and do not contribute to a high risk of failure (outcome). However, some individuals may well have scores on these statistically irrelevant features which make them incur higher risk scores.

5 The PTRA, however, does build on the analyses by VanNostrand and Keebler (Citation2009) who present regression tables for determinants of pretrial outcomes.

6 Moreover, the AUC-ROC is calculated for a range of thresholds above which a predicted score counts as the positive outcome class. However, not all thresholds are equally plausible or adequate, meaning only a part of the AUC-ROC region is relevant. As noted earlier, choosing a certain threshold directly means giving a false positive greater (or the same) weight than a false negative or vice versa. Which threshold(s) are plausible therefore depend on this weight ratio and, ideally, the interpretation of the AUC-ROC occurs against a clear definition of the relative weight of the classification errors.

7 If one were to get a classifier performance below 0.5, e.g., 0.3, one can simply flip the positive and negative class and get the inverse, i.e., 0.7.

8 It should be noted that these R2-values reflect the fit of models containing features that were pre-selected based on their relevance judging from bivariate analyses. This means the models do not contain totally irrelevant predictors for which the model fit would be punished by the Pseudo-R2 as these are deflated depending on the number of all included predictors. This means that without first winnowing the field predictors, the overall model fit would have been lower.

Additional information

Funding

The authors disclose receipt of the following financial support for the research, authorship, and/or publication of this article: This research has been conducted within the project “Deciding about, by, and together with algorithmic decision-making systems”, funded by the Volkswagen foundation.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 312.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.