748
Views
2
CrossRef citations to date
0
Altmetric
Letter to the Editor

Comments on: A Re-analysis of Repeatability and Reproducibility in the Ames-USDOE-FBI Study, by Dorfman and Valliant

ORCID Icon
Article: 2188069 | Received 13 Dec 2022, Accepted 02 Mar 2023, Published online: 19 Apr 2023

Dorfman and Valliant (Citation2022), hereafter DV, take issue with conclusions from Bajic et al. (Citation2020), the Ames Laboratory technical report, hereafter AL, containing the results of an FBI-sponsored firearms examiner study. In their paper, DV discuss our analysis of examiner repeatability and reproducibility; they do not address the more fundamental issue of accuracy in this article. Many of their comments lack an objective basis, and some others misrepresent how the results of this study should be viewed.

DV present a brief overview of the study described in AL in Sections 1 and 2, and their more specific discussion begins with Section 3.

1 Subjective Statements

Before addressing DV’s more technical concerns, it is important to point out the subjective nature of many of the statements made in the paper, all suggesting that the repeatability and reproducibility indicated by the data are poor or unacceptable. Examples from Sections 3 and 4 include:

  • “This level of disagreement seems high.” (p. 177)

  • “The amount of disagreement under any scenario seems substantial, and certainly gives us the impression of not very strong repeatability.” (p. 177)

  • “These are rather large percentages of disagreement.” (p. 178)

  • “The level of repeatability and reproducibility as measured by the between rounds consistency of conclusions would not appear to support the reliability of firearms examination.” (p. 178)

No objective rationale is offered for any of these statements; they are simply stated as the subjective impression of the authors. On p. 180, they state that Further analysis leads to the conclusion … that repeatability (and a fortiori reproducibility) is not satisfactory. But this analysis depends upon unjustified rules-of-thumb as well as mistaken arguments, as I explain below.

2 Observed and Expected Agreement, and Cohen’s Kappa

The repeatability and reproducibility analyses in AL are based on calculated (estimated) pairs of observed and expected agreement proportions. AL contains several graphs showing, for each examiner (repeatability) or pair of examiners (reproducibility), the observed and expected agreement proportions, specific to the same-source (matching) and different-source (nonmatching) sets examined. DV also discuss expected and observed proportions, offering a tutorial on how they are calculated. For present purposes, I’ll call these two proportions PE and PO, respectively, following the notation of their paper. Here, they introduce Cohen’s kappa, one measure of rater agreement that has been popular for many years, especially in the social sciences. As presented in eq. (1) on p. 181, kappa combines the observed and expected agreement proportions into one number, as κ=(POPE)/(1PE)

There are reasons why this index has been very popular, not the least of which is the combining of two numbers into one. This transformation comes with an accompanying loss of information; kappa can be computed from PO and PE, but not the other way around. As a result, one must accept the idea that the specific functional form of kappa preserves the relevant information about agreement. Many of the authors’ arguments, and their implication that levels of reliability and reproducibility apparent in the AL data are poor, are based on values of kappa.

2.1 Acceptably Large Values of Kappa

Suppose you are willing to reduce each (PO,PE) pair to a single value of kappa; how large should you demand these be? DV admit that This has been debated with various standards suggested over the years (p. 181) and then offer a table from McHugh (Citation2012) who “suggests” interpretations of kappa values in the context of comparisons affecting health care. As an example, kappa of 0.40 – 0.59 is said to represent “weak” agreement. But as with the authors’ statements regarding their impressions noted above, these assignments are, at best, based on what has been observed empirically in fundamentally different settings. In particular, there is no acknowledgement that levels of agreement depend heavily upon, among many other things, the degree and reliability of the physical information available in the examined material. For example, forensic firearms examinations—regardless of how they are carried out—involve the comparison of multiple items, each of which is an indirect and imperfect reflection of the patterns present in the surfaces of interest, while medical diagnostics often involve a direct analysis of a single biological sample collected under clinical conditions. The edited Figure 12.a from AL on p. 181 of the paper suggests that a value of kappa of 0.80—the border between McHugh’s “moderate” and “strong” agreement—should have been required, but the authors’ only justification for it is their opinion that anything less seems grounds for questioning the reliability of firearms comparisons.

2.2 Confusion of Statistical Independence with “Guessing”

The authors note on p. 181 that, provided PE is not 1, kappa is zero when the expected and observed proportions are the same, corresponding to points on the 45 degree line in the AL report figures. They argue that our emphasis on the fact that observed agreement generally exceeds expected agreement is not particularly informative because (e)xpected agreement turns out … to be the equivalent of the observed agreement of a blind observer, astutely guessing. In the appendix, they define “astute guessing” as being a situation in which the examiner knows the relative proportion of right answers in the tests s/he will face. There are two reasons why this perspective is inappropriate for our application.

First, the examiners in our study did not have this information; they did not know the relative numbers of same-source and different-source sets they were asked to evaluate. Overall, the collections of sets distributed contained a ratio of approximately 1:2 same-source to different-source sets, but examiners did not know this. Further, this ratio varied from collection to collection of examination sets. That is, the information the examiners would have needed to make “astute guesses” as described by DV was not available to them.

More important, DV describe “astute guessing” for a framework in which expected agreement is computed for a mix of same-source and different-source evaluations. This is a setting in which kappa is indeed often invoked. But in contrast to this, all expected and observed proportions presented in AL are conditioned on ground-truth state. That is, there was no mix of same-source and different-source material leading to any of the estimates of expected or observed agreement. We maintained separate analyses for same-source and different-source comparisons because we did not want our calculations to reflect the artificial ratio of same-source-to-different-source comparison sets set by the study design.

But for the sake of argument, suppose our examiners had assumed a 1-to-2 mix in the material they were asked to evaluate (again, information they didn’t actually have) and had made independent “astute guesses” based on this information as described by DV on both sides of the study, that is, they had said “ID” with probability 1/3 and “Exclusion” with probability 2/3 each time. Expected agreement would then have been (1/3)2+(2/3)2=0.555 for both same-source and different-source sub-studies. But look at DV’s version of Figure 12.a; most of the PE values plotted are clearly above this. Table XIII from AL (replicated in DV’s Table 13) reports that in this case, the average value of PE across examiners is 0.664. Similar observations can be made for the analogous figures for different-source sets and cartridge cases presented in AL. That is, estimates of ground-truth-specific expected agreement are well above the “astute guesses” offered by DV for an entirely different scenario, and the observed proportion of agreement seen in the study exceeded even this higher benchmark—often substantially—for almost all examiners. Note finally that DV’s Figure 12.a is computed for same-source sets where all three inconclusive categories have been combined. Figure 11.a displays expected and observed frequencies for the same set of evaluations, but using the entire AFTE scale including three inconclusive categories; here the average expected agreement, 0.637, is lower reflecting “disagreements” involving two inconclusive categories, but is still greater than the value of 0.555 available to hypothetical “astute guessers” who have inside information, and who do not need to use the inconclusive categories at all.

Our reported expected agreement is simply the estimated proportion of agreement one would see in the long term, conditioned on one kind of ground-truth only, if examinations were statistically independent—something completely different than is suggested by any phrase suggesting guessing. Suppose that in a particular situation, an examiner’s error rate for a specific kind of material (i.e., ground-truth) is 0.05, and that her judgements from call-to-call of the same material are independent. In this case, in two evaluations of the same evidence, she would “agree with herself” 0.052+0.952=90.5% of the time; this is the value of both PE and PO for this examiner. But she’s not “guessing” in any reasonable sense of the word; she’s right 95% of the time in her evaluations of this kind of material, regardless of the mix of material she’s asked to evaluate.

2.3 Estimation of Kappa from Small Samples

Reliable estimates of kappa can be computed for very large samples and when PE is not very close to 1, as with the hypothetical “mega-examiner” example on p. 180 where DV use the pooled data across the entire study to calculate proportions. As explained in AL, there is strong evidence that response probabilities vary across examiners, so calculation of kappa (either for ground-truth-specific evaluations or the artificial mix determined by the study design) for the “mega-examiner” aren’t meaningful except as a numerical demonstration. As DV note, estimates of expected and observed agreement proportions in AL were computed separately for each examiner (reliability) and pair of examiners (reproducibility) for which data were available. Again, this was done separately for each of same-source and different-source examination sets. These are, individually, very small samples—of necessity because each evaluation performed by a firearms examiner takes a substantial amount of time. As a result, the estimates of kappa that can be computed for individual examiners and examiner pairs are enormously unstable. In many cases they are undefined because the proportion of expected agreement is exactly 1 (and so the denominator of the fraction is zero). Evidence of this problem is actually visible in the upper right corner of DV’s edited version of AL Figure 12.a on p. 181. Note the heavy cluster of points where both observed and expected agreement proportions are 1. In the analysis that led to AL, we actually did calculate estimates of kappa for each examiner and examiner pair for those situations in which it is defined, and identified the (many) instances where it isn’t defined. But because this statistic is so “noisy” for proportions based on small samples, even when the denominator isn’t exactly zero, the results were substantially less helpful than the more informative retention and comparison of PO and PE. Related to this, McHugh warns that for even substantially larger sample sizes, confidence intervals for kappa may be so wide as to be of limited value.

3 Defensible Interpretation

So, what paired values of PO and PE, or for that matter of the arguably less informative kappa, do constitute “good repeatability” and “good reproducibility”? DV cite McHugh as suggesting that any answer should depend on the importance of the evaluation. This is certainly true, but it should clearly also depend on features that may be more difficult to characterize such as the quantity and quality of the available physical information in the examined samples. DV maintain the agreement proportions we present do not represent satisfactory agreement, but base this conclusion on an argument involving inappropriate expected proportions, and values of kappa derived from rules-of-thumb without objective justification for this application.

In AL, we simply present paired examiner- and examiner-pair-specific expected and observed proportions of agreement, for each ground-truth situation. What is clearly apparent in the data is a strong tendency toward better agreement than would be expected based on independent evaluations by examiners. In addition, this appropriate expected agreement, in turn, is better than what would be achieved by DV’s “astute guessers.” I agree that this does not constitute as clean a quantification of agreement as we would all prefer. But characterizations that appear to be quantitative without an objective basis are of limited value and are potentially misleading. In the absence of better arguments, the consistent exceedance of ground-truth-specific observed agreement over the appropriate corresponding expected agreement is an important indicator of reliability and reproducibility.

References