ABSTRACT
Algorithmic decision-making (ADM) promises to strengthen evidence-based decisions, particularly to better manage risks in various domains. Its use also extends to the criminal justice system where algorithmic risk assessments potentially provide very valuable evidence that can inform highly sensitive decisions. Yet, such algorithmic tools also introduce intricate problems that are tied to the fundamental question of exactly what kind and what quality of evidence they offer. This paper illustrates this problem based on a comparison of pretrial risk assessments that have been implemented statewide in the USA. The authors highlight the empirical variation in the construction, evaluation and documentation of these tools to carve out the considerable discretion involved along these dimensions. They also point to further possible ways of looking at the performance of these tools and show why evaluating the quality of the evidence delivered by algorithmic risk assessments is a far from straightforward affair.
Acknowledgements
We thank the anonymous reviewers for their very helpful comments and suggestions. Thanks also go to Malin Grüninger and to Louisa Prien for their assistance in researching the info used in the article.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
3 And it could lead to a differential treatment of groups of people. Due to a biased practice of disproportionately arresting certain groups with the same crime rate as other groups, criminal history can reinforce existing discrimination: ‘[r]acial bias in arrests leads to racial bias in risk scores’ (Eckhouse et al., Citation2019, p. 196).
4 On a more technical note, a trained statistical model of an algorithmic risk assessment tool may still be useful overall for predicting outcomes even individual predictors are not statistically significant. If many of the included predictors are correlated with each other, they share variation and thus could partly be substituted for each other. This also means that the unique variation and information that a predictor has for explaining an outcome is decreased. This multicollinearity of predictors makes it less likely that their coefficients are significant under conventional levels. If the primary aim is to build a well-performing prediction model, one could give less weight to the Type-I error, largely ignore the statistical significance of predictors and rather aim to avoid the Type-II error: accepting the null hypothesis of ‘no effect’ even if there is an effect. However, the models will be inefficient and substantively misleading because some predictors could be dropped without losing predictive power and because one does not know which predictor counts as a relevant explanatory factor. Consequently, when accepting a high Type-I error, one might keep predictors in the model that themselves have no predictive power and do not contribute to a high risk of failure (outcome). However, some individuals may well have scores on these statistically irrelevant features which make them incur higher risk scores.
5 The PTRA, however, does build on the analyses by VanNostrand and Keebler (Citation2009) who present regression tables for determinants of pretrial outcomes.
6 Moreover, the AUC-ROC is calculated for a range of thresholds above which a predicted score counts as the positive outcome class. However, not all thresholds are equally plausible or adequate, meaning only a part of the AUC-ROC region is relevant. As noted earlier, choosing a certain threshold directly means giving a false positive greater (or the same) weight than a false negative or vice versa. Which threshold(s) are plausible therefore depend on this weight ratio and, ideally, the interpretation of the AUC-ROC occurs against a clear definition of the relative weight of the classification errors.
7 If one were to get a classifier performance below 0.5, e.g., 0.3, one can simply flip the positive and negative class and get the inverse, i.e., 0.7.
8 It should be noted that these R2-values reflect the fit of models containing features that were pre-selected based on their relevance judging from bivariate analyses. This means the models do not contain totally irrelevant predictors for which the model fit would be punished by the Pseudo-R2 as these are deflated depending on the number of all included predictors. This means that without first winnowing the field predictors, the overall model fit would have been lower.