1,095
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Non-inferiority testing for qualitative microbiological methods:Assessing and improving the approach in USP <1223>

, & ORCID Icon
Pages 915-941 | Received 28 Sep 2020, Accepted 03 Apr 2022, Published online: 21 Sep 2022

ABSTRACT

The United States Pharmacopoeia (USP) presents two approaches for showing non-inferiority of an alternate qualitative microbiological method versus a compendial method. One approach compares the positive rates for the alternate and compendial methods at one spike level, while the other one compares multiple most probable number (MPN) estimates from a multi-spike design using a t-test. In this paper, we discuss these approaches under certain assumptions and propose a third approach that can be used for both single and multiple dilutions, which we call the generalized MPN (gMPN) approach. Simulations, using Poisson distributed numbers of microorganisms in test samples, confirm that the USP approach based on rates is not suitable, that the USP approach based on MPNs is appropriate for non-inferiority, but the gMPN approach outperforms the MPN-based approach and is therefore recommended.

1. Introduction

To show that a new or alternative qualitative microbiological method is acceptable to replace a current or compendial method, the USP <1223> (Citation2015) guideline states that the laboratory must demonstrate that the new procedure is as good as or better than the current procedure in terms of the ability to detect microorganisms. The USP recommends non-inferiority testing, for which it proposes two different approaches.

The first approach is based on the ratio of the proportions pA and pC of positive samples for the alternative and the compendial method at a single spike level, respectively, similar to non-inferiority of clinical events in clinical trials. A pre-defined non-inferiority margin quantifies what difference is allowed. The second approach is based on a comparison of the most probable numbers (MPN) using a design with multiple dilutions (Cochran Citation1950). Absence/presence results obtained from multiple dilutions are used to estimate one bacterial density of organisms in the original solution. This is done multiple times with the alternative and the compendial method to create repeated estimates of the bacterial density. Then non-inferiority is tested on two sets of MPNs using a t-test.

The USP does not provide guidance when to use which of the two non-inferiority approaches, neither discusses how to interpret the results from these approaches. Based on a statistical model for the detection of microorganisms (IJzerman-Boon and Van den Heuvel Citation2015), we will assess and improve the non-inferiority approaches.

Section 2 describes the two USP approaches, on positive rates and on MPNs, as well as their implicit assumptions. Section 3 describes the statistical detection model and proposes a third approach (the generalized MPN approach). Section 4 presents different experimental designs to determine MPN estimates and explains that under the described distributional assumptions, results for testing non-inferiority with the MPN are not expected to differ. Subsequently, Sections 5 and 6 describe our simulation study and present the results, respectively. Section 7 presents (a discussion on) our conclusions.

2. Non-inferiority testing according to USP <1223>

2.1. Approach 1: non-inferiority on positive rates

In order to show non-inferiority, it is required that both methods test similar sets of samples (see Section 2.3 for details). The null hypothesis is then formulated as H0: pA/pCr0 against the alternative hypothesis H1: pA/pC>r0, with r00,1 the non-inferiority margin. By proposing a non-inferiority margin of −0.2 for the difference pApC, the USP indirectly suggests a non-inferiority margin of r0=0.8 for the ratio. Following Farrington and Manning (Citation1990), the null hypothesis can be rewritten as H0: pAr0pC0, and is rejected when

(1) pˆAr0pˆC/wˆ01/2>z1α,(1)

with z1α the 1001  α% percentile of the standard normal distribution, pˆA and pˆC the standard estimates of the probabilities pA and pC to detect positive test samples, and wˆ0 the estimated variance of pˆAr0pˆC under the null hypothesis. Thus, when XA and XC are the numbers of positive samples among nA and nC test samples tested with the alternative and compendial method, respectively, we have pˆA=XA/nA, pˆC=XC/nC, and wˆ0 is

(2) wˆ0=p˜A1p˜A/nA+r02p˜C1p˜C/nC,(2)

with p˜A and p˜C the maximum likelihood estimators (MLE) under the null hypothesis, i.e. p˜C=p˜A/r0, p˜A=bb24ac1/2/2a, a=1+k, b=r01+kpˆC+k+pˆA, c=r0pˆA+kpˆC, and k=nC/nA. A 5% significance level is used, i.e. α=0.05.

Note that this approach is applicable to test samples from a single dilution, since a clear extension to multiple dilutions (with different probabilities for testing positives) is not known. The USP recommends a spike level of microorganisms for this dilution at which 50–75% of the samples would be expected to be positive when tested with the compendial method.

If, instead of testing independent samples from a single dilution with the alternative and compendial method, the same n samples are tested by both methods, then the results can be displayed in a 2 × 2 table (), and a paired test can be applied. In formula (1), we then use (Lachenbruch and Lynch Citation1998) pˆA=XA/n=X11+X10/n, pˆC=XC/n=X11+X01/n, resulting in pˆAr0pˆC=X10+1r0X11r0X01/n, and we replace the variance estimate (2) by

Table 1. Lay-out of results for a paired test – Numbers of positive and negative samples (associated probabilities).

(3) wˆ0=[pˆ101pˆ10+1r02pˆ111pˆ11+r02pˆ011pˆ01+2r0pˆ10pˆ0121r0pˆ10pˆ11+2r01r0pˆ01pˆ11]/n,(3)

where pˆij=Xij/n. Note that USP <1223> incorrectly suggests a variance of XAX10+X01/XC3, which is an estimate of the variance of pˆA/pˆC instead of  pˆAr0pˆC.

2.2. Approach 2: non-inferiority on MPNs

For the MPN-based approach, the null hypothesis H0: logλ Alogλ Clogr0 needs to be rejected in favor of H1: logλ Alogλ C>logr0, where λA and λC are the theoretical bacterial densities for the alternative and compendial method, respectively. Following Cochran (Citation1950), the probability that a sample is tested positively equals p=1exp(λ), when λ is the mean number of organisms in the test samples and the microbiological method detects organisms perfectly. Given an estimate pˆ0,1 for the proportion p, based on test samples with volume v taken from a single solution with volume V, the bacterial density and corresponding number of organisms in the solution are estimated by

(4) λˆ=log1pˆandMPNˆ=V/vλˆ.(4)

The MPN can also be estimated from a multiple dilution experiment, but then a closed-form expression does not exist.

Based on NA and NC MPN estimates or replicates for the two methods, respectively, a t-test for non-inferiority can be applied to the log-transformed estimates. Note that there are different ways of generating these MPN estimates (further discussed in Section 4), but when all MPNs are trying to estimate the same bacterial density and are estimated based on independent samples from the same stock solution (), then a two-sample t-test can be used. If YˉA and SA denote the average and standard deviation of the NA estimates for logλ A (or equivalently, for logMPNA) for the alternative method, and YˉC, SC, and NC for the compendial method, then non-inferiority can be concluded if

Figure 1. Visualization of conditional and marginal probabilities (7) and (8)

Figure 1. Visualization of conditional and marginal probabilities (7) and (8)

Figure 2. Experiments to generate multiple MPN estimates in a 3 × 3 design for 2 methods.

Figure 2. Experiments to generate multiple MPN estimates in a 3 × 3 design for 2 methods.
(5)  YˉA YˉCt1α,dfSA2NA+SC2NC >logr0,(5)

where t1α,df denotes the 1001  α% percentile of the t-distribution with the Satterthwaite degrees of freedom

df=SA2/NA+SC2/NC2SA2/NA2NA1+SC2/NC2NC1.

In case the MPNs for the two methods are estimated based on samples from the same dilutions (), even though in this case the two methods do not test the exact same samples, a paired t-test may be more appropriate. When Yˉ and S2 denote the sample mean and sample variance of the N paired differences Yˆ=logλˆAlogλˆC of the log-transformed MPN estimates for the alternative method minus the compendial method, then non-inferiority can be concluded if

(6) Yˉt1α,N1SN>logr0.(6)

2.3. Implicit assumptions in USP

In the first approach, the non-inferiority claim would only hold for the tested spike that resulted in the positive rates for which non-inferiority could be concluded. At that spike level, and under the assumption that both methods indeed received samples with similar spike levels, the two methods are likely to come to the same pass or fail conclusion. In the USP, this is referred to as decision equivalence. However, decision equivalence does not at all imply that the two microbiological methods have approximately the same sensitivity. To illustrate this, think of two microbiological methods, one detecting only molds, and one detecting only bacteria. When samples with mixtures of bacteria and molds are offered to both methods, the positive rates may be equivalent for the two methods. Nevertheless, it is obvious that the methods do not have the same sensitivity for both types of microorganisms and as a microbiologist you would never rely on just one method. Another situation where two methods might appear to be similar, is if you offer samples with a high spike. Then, even a poor test method will return only positive results. In these examples, it is clear that the number of positive samples does not only depend on the microbiological method, but also on the numbers of organisms in the test samples, and non-inferiority for one spike level (e.g. the level at which 50–75% of the samples would be positive with the compendial method) does not necessarily imply non-inferiority at other spike levels.

Another implicit assumption is that all samples have the same fixed probability (pA or pC depending on the test method), of becoming positive. This would be reasonable only when all samples contain the same number of organisms, but creating samples with a fixed number of organisms is currently impossible in microbiology. Even when all samples are taken from the same solution, numbers of organisms will vary from sample to sample because of sampling variability, and therefore some samples have lower probabilities of being detected positively than other samples.

In the second approach, the MPN method implicitly assumes that organisms are distributed randomly and unaggregated throughout the solution, such that the number of organisms in a small amount or test sample taken from it follows a Poisson distribution (Cochran Citation1950; Garthright and Blodgett Citation1996). Samples are assumed independent of each other (Garthright and Blodgett Citation1996), which means that the probability of a sample to be positive is not affected by other samples being positive or negative. This is practically true when the total volume of all samples constitutes a small volume of the total solution, say less than 10%.

Additionally, as Cochran (Citation1950) phrased it for growth-based methods: “each sample, when incubated in the culture medium, is certain to exhibit growth when it contains one or more microorganisms”. This would imply that every organism is detected by the method. But, if perfect detection is already part of our assumptions, what are we demonstrating when we show non-inferiority on MPNs?

3. Non-inferiority testing based on a statistical model for the detection of microorganisms

Since the number of positive test samples is not only determined by the method, but also by the unknown spike, we need an appropriate model that describes how a positive result is achieved and that can separate the spike level from the sensitivity of the method. To do this, we introduce two concepts the first concept describes the detection probability to detect one sample as positive as a function of the true number of organisms in the sample, the second concept considers a distribution for this unknown true number of organisms when we sample from a solution. Full knowledge of the first concept of the model would provide full insight into the sensitivity of the method for each possible number of organisms in the test sample. The detection probability, which should be an increasing function of the number of microorganisms X in the sample, may have parameters that would characterize the performance of the method.

3.1. The binomial-Poisson model

Denote the outcome of the test sample by Z, which would be 0 for a negative result, and 1 for a positive result. A simple model (IJzerman-Boon and Van den Heuvel Citation2015; Van den Heuvel and IJzerman-Boon Citation2013) assumes that each microorganism has a fixed probability θ (between 0 and 1) to be detected by the microbiological test method. This parameter has been called the detection proportion and represents the probability to detect one organism. The detection probability pX that the microbiological test would return a positive test result for a sample, i.e. detect at least one organism, can be expressed as a conditional probability given the number of organisms X in the sample,

(7) pX=P(Z=1|X)=11θX.(7)

This model has been referred to as the binomial detection model, since it is based on the binomial distribution when the method detects organisms independently from each other. Note that for X=1, the detection probability p1 reduces to the detection proportion θ.

Unfortunately, it is impossible to spike test samples with a fixed number of microorganisms. The spike exhibits variability, and therefore we cannot observe the function (7) directly. If the spikes X follow a Poisson distribution with a mean of λ, which is a common and often a reasonable distributional model for count data (Cochran Citation1950), then the marginal probability to detect a sample as positive equals the expected positive rate p and can be written as

(8) p=PZ=1=1expθλ.(8)

A graphical representation of (7) and (8) is provided in .

Based on an experiment, we can estimate the expected positive rate p from (8) as described earlier. However, the sensitivity of the method is reflected by the parameter θ, which cannot be separated from the average spike per test sample λ. We can only estimate the product ξ=θλ. For a single dilution, it is estimated by

(9) ξˆ=log1pˆ,(9)

and its corresponding variance would be estimated by τˆ2=pˆ/n1pˆ (IJzerman-Boon and Van den Heuvel Citation2015). Note that for θ=1, this estimator is just the MPN estimator for the bacterial density λ (Cochran Citation1950), presented in (4). Hence, our binomial detection model with θ<1 can be viewed as a generalization of the MPN and will be referred to as the generalized MPN.

3.2. Approach 3: non-inferiority on generalized MPNs

Observing that what we estimate is generally not the bacterial density itself, but its product with the detection proportion, it becomes clear that comparing the two test methods should be done by considering the ratio ξA/ξC, which equals the accuracy or recovery θA/θC, provided both methods tested samples from the same solution with an average spike of λ per test sample. Taking the ratio eliminates the spike level λ from the test statistic. Testing from the same solution is important, since otherwise the average spike λ does not cancel out when taking the ratio. This recovery or accuracy quantifies the relative performance of two qualitative methods and can be used to test non-inferiority of the alternative method compared to the compendial method (EP 5.1.6 Citation2017).

Approximate confidence limits for this ratio θA/θC or its logarithm logθA/θC have been derived in IJzerman-Boon and Van den Heuvel (Citation2015). Since in the log-scale a coverage is obtained that is closer to the nominal level, non-inferiority would be concluded if

(10) logξˆA/ξˆCz1ατˆC2/ξˆC2+τˆA2/ξˆA2>logr0,(10)

where z1α is the 1001  α% percentile of the standard normal distribution.

Expressions (9) and (10) hold for a single dilution. Like with the MPN method, also multiple dilution experiments can be used to estimate the (ratio of) detection proportions and corresponding confidence limits using maximum likelihood, but closed-form expressions do not exist anymore (IJzerman-Boon and Van den Heuvel Citation2015). Then, a statistical package like SAS® can be used to perform the calculations.

3.3. Comparing the gMPN with the USP non-inferiority methods

(solid lines) displays functions (7) and (8) for different values of the detection proportion θ, where θC=0.95 might reflect the compendial method and θA=0.7 the alternative method. shows the detection probabilities that we are interested in, since it provides the performance of the alternative and compendial method when they would test samples with a fixed number of microorganisms. When using a non-inferiority margin of r0= 0.8 in this example, the alternative method is obviously inferior compared to the compendial method, since the ratio θA/θC (black dot) of the two detection proportions for detecting samples with exactly one microorganism (X=1) is smaller than the non-inferiority margin r0 (reference line). Note that it is irrelevant that the ratio of the detection probabilities for samples with higher numbers of microorganisms (X>1) is larger than 0.8, since the methods are inferior for detecting one microorganism and thus inferior in detecting microorganisms. reflects the expected positive rates (pC for the compendial and pA for the alternative method) that we would observe in experimental data as a function of the average spike level. Since test samples will vary in their number of microorganisms in practice, they average out the detection probabilities in into the positive rates. also shows the ratio (dark dashed line) of these expected positive rates that is used to test non-inferiority in the first USP approach and the difference in positive rates (light dashed line). For λ=0, the ratio starts at θA/θC and then increases to 1 for higher spike levels. Thus, theoretically we would be able to demonstrate inferiority of the alternative method with respect to the compendial method using the ratio of positive rates when we would be able to create a dilution with an average number of microorganisms below, say, 0.7, since then the ratio of positive rates is less than 0.8 as well. However, there are a few practical concerns. First of all, spiking dilutions is (very) imprecise, and we may easily end up with an average spike (far) above 0.7, in which case we may declare the alternative method non-inferior (since the ratio of positive rates is then above the non-inferiority margin of 0.8). Secondly, the USP suggests spike levels for experimentation for which the positive rate of the compendial method is between 50% and 75%, i.e., spike levels that would result in non-inferiority since they provide ratios of the positive rates above the non-inferiority margin. Performing the experiment in this range would suggest the use of a difference in positive rates, since it does provide the largest absolute difference between the two methods, but the difference in positive rates (pCpA) would never be larger than 0.2 (an equivalent non-inferiority margin for differences when the ratio is 0.8 and the compendial method is close to perfect) for any of the spike levels (see the difference curve at the bottom of ). Thirdly, even if we would be able to create low spike levels, the ratio of the positive rates still provides a somewhat better result than the ratio of detection proportions θA/θC since it is larger than this ratio, unless we would spike very close to the level of blank samples, but then this would blow up the standard error of the estimated ratio and it would require very large numbers of samples. Finally, even if one would be willing to be less stringent and, instead of requiring non-inferiority on detecting a single organism, only require non-inferiority on the positive rates from a certain spike level onwards (probably at a more stringent non-inferiority margin), the uncertainty of the spike remains a problem, since it is impossible to estimate at which spike level the non-inferiority conclusion was drawn, nor does it tell anything about detection at lower spike levels or of just one organism.

Comparing formula (10) on generalized MPNs with formulas (5) and (6) on MPNs shows that both approaches actually evaluate the same thing, since the logarithm of a ratio equals the difference of the logarithms. Although the MPN approach implicitly assumes that the method is perfect, the similarity between (4) and (9) explains why a difference between MPNs, which would be an internal conflict with the assumption that two perfect methods are used to estimate the bacterial density of the same solution, may be attributed to or interpreted as a difference in detection between the two methods. The difference between the MPN approach and the generalized MPN approach is that in the latter approach all data are used to come up with one combined generalized MPN estimate for the ratio ξA/ξC and its standard error, while in the MPN approach, first multiple MPN estimates are generated and those are used as the data in the calculations.

4. Independent and paired experimental designs for the MPN approach

MPN experiments with multiple dilutions can be executed in different ways. One way is to generate all MPN estimates for both methods based on independent dilution series from the same stock solution (). This implies that the MPNs can be considered independent estimates for the same bacterial density and that the two-sample t-test for independent samples can be used. Instead of creating different dilution series per method, one could take samples for both methods from the same dilutions simultaneously (). This would prevent that potential pipetting errors in creating the dilutions would lead to different results between the methods. This experiment provides paired data at the level of the dilutions. There is no formal approach mentioned in the USP to properly address this pairing, but the paired MPN t-test, in the USP only suggested for the rare case where the exact same samples are tested with both methods, is applicable to this design as well.

An alternative for both cases, which may be more practical if testing cannot be performed on one day, would be that multiple stock solutions are used, from each of which one dilution series is created for the alternative method and one for the compendial method (). In this case, data are paired at the level of the stock solutions. The disadvantage of this paired design is that the bacterial densities will vary with stock solution due to variation in spiking. Nevertheless, the paired MPN t-test can be applied in this case.

Finally, one could use different stock solutions for the two methods, but this should be avoided, because differences observed between the two methods might then be caused by differences in the spikes λ for the two methods, rather than differences in detection by the methods.

In general, these different designs and ways of pairing samples may lead to different correlation structures between the samples. This might require different statistical analysis methods or, when analyzed using the same statistical methods ignoring correlations, could lead to different results. If, however, the number of organisms N in the stock solution follows a Poisson distribution and samples are generated from that solution using binomial or multinomial sampling, then the numbers of organisms in the individual samples also follow a Poisson distribution. In addition, independence between the samples can be proven in this case, even though the experimental design suggests dependent samples (Appendix 1). Under this assumption, it therefore does not matter whether the different samples are collected in an independent or dependent way. This simplifies simulations, since the different ways of pairing in MPN experiments () can be ignored, and the independent Poisson data generated to evaluate the independent two-sample t-test on MPNs, can also be used to evaluate the paired MPN t-test, after pairing MPN estimates randomly. Thus, under our Poisson assumptions, the MPN estimates are independent, and the paired t-test cannot be expected to gain power. To the contrary, due to a lower number of degrees of freedom used in the paired t-value, confidence intervals will get wider, leading to lower power for non-inferiority. Simulations will not be shown but are available on request.

5. Simulations

To compare the performance of the two USP approaches, based on the positive rate (referred to as USP1) and the (independent two sample) t-test on MPNs (referred to as USP2), with the generalized MPN approach (referred to as gMPN), simulations were performed for different parameter settings and designs. USP1 and gMPN were compared using designs with a single dilution with various spike levels (λ=0.5 to 3), from which 200 samples per method were taken (nA=nC=n=200). Multiple dilution designs were used to compare USP2 and gMPN. Typical MPN designs include 3 two-fold or ten-fold dilutions with 3 or 5 samples tested per dilution (USP <1223> Citation2015; De Man Citation1983; Garthright and Blodgett Citation2003). These designs are denoted by 3 × 3 or 3 × 5, and we repeated them 22 or 13 times to get a total sample size of almost 200. In order to see whether changing the number of samples per dilution would make a difference, we also evaluated designs with other numbers of test samples per dilution (3 × 4, 3 × 6, … , 3 × 33). The 3 two-fold or ten-fold dilutions were chosen at λ=4,2,1 and λ=20,2,0.2, i.e. such that the middle dilution would have a spike level of λ=2.

We assumed a detection proportion of θC=0.8 for the compendial method and evaluated the Type I error rate of (incorrectly) concluding non-inferiority at the non-inferiority margin of r0=0.8 (θA=0.64) proposed in the guideline and at a lower ratio of 0.7 (θA=0.56). For the evaluation of the power to (correctly) conclude non-inferiority, we assumed equal detection proportions θA=θC=0.8, a non-inferior and lower detection proportion (θA=0.9θC=0.72), and a superior detection proportion (θA=1.1θC=0.88) for both non-inferiority margins r0=0.8 and r0=0.7. For all parameter settings, 10,000 simulations were performed.

Assuming independence between the samples, simulation results were generated by drawing the true number of organisms in each sample from a Poisson distribution with mean λ for a single dilution, and λ divided by the appropriate dilution factor for samples from a dilution series. The detection probability for each sample was then calculated using formula (7), and the outcome of the sample was positive if this probability was larger than a random number between 0 and 1, and negative otherwise. The simulated data were analyzed using the different approaches USP1, USP2, and gMPN. SAS and R codes for such analyses are presented in Appendix 2. For the USP2 approach, MPN replicates sometimes failed. In those cases, the MPN for the alternative and/or the compendial method could not be estimated because all samples in all dilutions were positive or negative for that method. In simulations where this occurred, the t-test was calculated based on the remaining MPN replicates, as one would usually do in practice. For all three methods, the Type I error and power were estimated by the percentage of simulations for which non-inferiority was concluded. Failure rates were estimated by the percentage of MPN replicates for which all samples were positive (all samples negative did not occur) out of the total number of MPN replicates across all simulations. Note that the expected failure percentages can also be calculated theoretically based on formula (8). For example, for a 3 × k design with spike levels λ1, λ2, and λ3, this would be 100i=131expθλik.

6. Results

First, we compared USP1 and gMPN on their Type I error rates. shows the expected Type I error rates of around 5% for the gMPN approach, but unacceptably high Type I errors for USP1. These Type I errors increase rapidly with the spike level up to almost 100% when λ3, which means that the USP1 approach based on the positive rates almost always concludes non-inferiority, while in fact the alternative method detects one organism with a probability of only 80% of that of the compendial method, which is exactly equal to the selected non-inferiority margin. Also when the ratio of detection proportions is chosen below the non-inferiority margin of 0.8 (θA=0.7θC=0.56), USP1 still shows rapidly increasing probabilities beyond 5% when the spike level increases from 1.5 to above (Appendix 3, ), while the gMPN shows probabilities below nominal (as expected). The last two columns in show that when the ratio of positive rates equals the non-inferiority margin (which can only occur at one specific spike level), the Type I error of USP1 is around 5%. The ratio of detection proportions is then already far below the non-inferiority margin of 0.8 and the gMPN shows Type I errors below nominal.

Table 2. Type I error rate (%) to conclude non-inferiority with approaches USP1 and gMPN for a single dilution design using a non-inferiority margin of r0=0.8, detection proportion θC= 0.8, θA chosen such that θA/θC=r0 or pA/pC=r0, and n=200 samples.

shows for the same single dilution design the power for non-inferiority when the detection proportions of the alternate and compendial method are equal (θA=θC=0.8). With 200 samples per method, the powers for USP1 are very high (over 95% for spike levels of λ=1.5 or higher), but we know that this is at the cost of highly inflated Type I errors. For the gMPN method, the power using a non-inferiority margin of r0=0.8 attains a maximum value of about 57% for λ2. To increase the power, one should either test even more samples than n=200 per method or relax the non-inferiority margin. When a margin of r0=0.7 instead of r0=0.8 would be applied, then the power for the gMPN approach would be sufficient, with a maximum value of 89% for the optimal spike level, and still values above 80% when the average spike level is between 1 and 3. Please note that other values than 0.8 for the equal detection proportions would have led to similar results at slightly different spike levels, since results are driven by the product ξ=θλ in formula (8). Appendix 3, and C3 show results for the power when detection proportions differ, i.e. for θA,θC=0.72,0.8 or 0.88,0.8. Apart from the powers being lower or higher, respectively, than for θA,θC=0.8,0.8, the pattern is the same.

Table 3. Power (%) to conclude non-inferiority with approaches USP1 and gMPN for a single dilution design using a non-inferiority margin of r0=0.8 or r0=0.7, detection proportions θA,θC=0.8, 0.8 and n=200 samples.

To evaluate the second USP approach, which uses a t-test on a set of estimated MPNs, we simulated multiple dilution designs and compared the Type I error and the power of USP2 with that of the gMPN approach. shows that the Type I error rates for a detection proportion ratio at the non-inferiority margin of 0.8 (θA=r0θC=0.64) are all close to the nominal 5% level for the gMPN approach, while for the USP2 approach they are in most cases smaller and below 4.5%. Only for dilution factor 2 in the 3 × 3, 3 × 4 and 3 × 5 designs, the Type I error for the USP2 approach is somewhat inflated, probably because of the higher number of failed MPN replicates in the compendial than in the alternative group, leading to elimination of the MPN replicates for which the compendial method had all samples positive. All failure rates were in line with the theoretical expected value. For a ratio of detection proportions below the non-inferiority margin (θA=0.7θC=0.56), probabilities for both USP2 and gMPN are below the nominal 5% level, but the pattern is the same (Appendix 3, ).

Table 4. Type I error rate (%) to conclude non-inferiority with approaches USP2 and gMPN for multiple dilution designs using a non-inferiority margin of r0=0.8, detection proportions θA,θC=0.64, 0.8 and n~200 samples.

Results in show that the power for gMPN is stable across the different designs. For USP2 with dilution factor 2, the power seems to decrease with a decreasing number of replicates, while it is stable for dilution factor 10, until the number of replicates drops below 5. This may be due to a low number of degrees of freedom that is then used in the t-test. Note that for dilution factor 2, more failures occurred than for dilution factor 10, since the different dilutions are more likely to have only positives if they are closer together. The failure rate also decreases when the number of samples per dilution increases, since it becomes more difficult for a replicate to fail, i.e. to have all samples in all dilutions positive. Except for the 3 × 3 × 22 design with dilution factor 2, the power for USP2 is always smaller than for gMPN. A reason for USP2 having lower power than gMPN is most likely that USP2 compares multiple (depending on the number of replicates 22, 16, etc.) MPN estimates between the two methods but does not use the precision of the MPN estimates themselves, while the gMPN approach uses all data together without discarding or losing information.

Table 5. Power (%) to conclude non-inferiority with approaches USP2 and gMPN for multiple dilution designs using a non-inferiority margin of r0=0.8 or r0=0.7, detection proportions θA,θC=0.8,0.8 and n~200 samples.

Comparing the gMPN results of with shows that the power for a single dilution experiment with a close to optimal spike level is higher than for a multiple dilution experiment, which includes only part of the data at the optimal spike level. Diluting further away from the optimal spike level decreases the power for both methods, which is also clear when comparing dilution factor 10 with dilution factor 2. Also, designs with five dilutions (not shown) would have lower power than designs with three dilutions.

Using a non-inferiority margin of r0=0.7 instead of r0=0.8 would, with a dilution factor of 2, be sufficient to increase the power to values above 80% for gMPN and USP2 with the 3 × 3 × 22 design, but not for USP2 with the other designs with less replicates.

Appendix 3, and C6 show results for the power when detection proportions differ, i.e. for θA,θC=0.72,0.8 or 0.88,0.8. Apart from the powers being lower or higher, respectively, than for θA,θC=0.8,0.8, the pattern is the same.

7. Conclusions

This paper presented the two USP <1223> non-inferiority approaches (USP1 on positive rates, USP2 on MPNs) and a statistical model that helped to interpret these approaches and that led to a third approach (generalized MPNs). Simulations illustrated the performance of the three approaches. It can be concluded that USP1 on positive rates is not suitable, since it concludes non-inferiority too often when the alternate method is inferior in detecting a single organism. This becomes even more severe for higher spike levels (already at 2–3 CFU/sample, which is still far below the spike level of 10–50 CFU that USP suggests), for which the expected positive rates become closer to 100% and closer to each other. Obviously, for sufficiently high spike levels, also a poor method has no difficulty detecting positive samples. Hence, the positive rate is not a good measure to evaluate the method performance, since it is influenced not only by the method but also by the spike level, and the conclusion of non-inferiority would only hold for the spike level tested, which cannot be estimated. Even if one would consider non-inferiority on the positive rates from a certain spike level onwards sufficient, then the spiking uncertainty in microbiology and the fact that the power changes rapidly with the spike level, make it very difficult to set up an experiment that guarantees this. Due to this dependence on the spike level, there is also no easy way to choose a lower significance level to get the Type I error under control, in an attempt to still benefit from the increase in power.

On the other hand, USP2 is a suitable approach to evaluate the sensitivity of the alternative method in comparison with the compendial method and is not driven by the spike level. However, the proposed set-up for MPN can be improved. Instead of using multiple dilutions, it would be better to use the optimal single spike level that maximizes the power of the test. Under realistic scenarios with detection proportions above 0.7, the optimal spike level would be approximately 2 CFU/sample (for more details, see IJzerman-Boon and Van den Heuvel Citation2015; Strijbosch et al. Citation1990). If multiple dilutions are used, for example, to mitigate the risk of spiking uncertainty, then a much smaller dilution factor should be used in order to stay as close as possible to the optimal spike level. Moreover, it is recommended to use as many MPN replicates as possible, but 3 × 3 or 3 × 4 experiments are not recommended due to the high probability of failed MPN replicates, which could lead to a bias in the evaluation.

An even better alternative is to replace the t-test by the generalized MPN analysis, i.e. to use gMPN instead of USP2. The MPN approach ignores the variability of the individual MPN estimates, which is overcome in an analysis of all data simultaneously by the gMPN. Use of all data together generally increases the power and reduces the risk of failed MPN replicates in the MPN experiment. Furthermore, the sample size of 75–100 that USP <1223> suggests for 80–90% power should be increased to about 200, since lower sample sizes do not provide this power level, not even when the non-inferiority margin is 0.7. To mitigate this increase in sample size, one may consider analyzing the data of multiple organisms together, provided their ratios of detection proportions are homogeneous (Emampour et al. Citation2021).

In the simulations, we assumed that the true counts in the samples were independent Poisson. This is a theoretical assumption for the estimation of MPNs in USP2 as well as for the ratio of detection proportions θˆA/θˆC in gMPN. It is not a serious drawback, since in practice it may always be approximately achieved by making sure that the volumes taken for the dilutions are small compared to the stock solution, and that the test samples are small compared to the dilutions. However, if samples are not independent Poisson, then the different ways of executing MPN experiments () may play a role, and taking into account some pairing in the analysis, using the USP2 paired MPN t-test, may become better than the gMPN approach, which currently ignores any dependence between the samples. This requires further investigation. For a single dilution design, the gMPN approach may still be used, but it may be best to completely divide the stock solution over the test samples, because in that case, the estimator based on Poisson is robust against under- or overdispersion (Manju et al. Citation2019).

One of the limitations of the gMPN approach, which also applies to the USP methods, is that we did not consider false positives in our model. If false positives are ignored, then they may compensate for false negatives, and non-inferiority may be concluded while in fact the alternate method has a lower sensitivity than the compendial in combination with false positives. This may be easily resolved, since IJzerman-Boon and Van den Heuvel (Citation2015) presented a zero-deflated binomial detection model, which extends the binomial model (7) with a parameter for the false positive rate. Use of this extended model slightly changes formulas (8), (9), and (10), but would allow a clean comparison of the sensitivity of two methods without interference by false positives.

In conclusion, the generalized MPN approach clearly outperforms the two USP approaches with a clear interpretation under the binomial-Poisson model that we presented. Non-inferiority on the positive rates is strongly discouraged. Furthermore, the robustness of the MPN and gMPN approach against other detection probability models must still be investigated.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is part of the research program Rapid Micro Statistics with project number 15990, which is (partly) financed by the Dutch Research Council (NWO).

References

  • Cochran, W. G. 1950. Estimation of bacterial density by means of the “most probable number”. Biometrics 6 (2):105–116. doi:10.2307/3001491.
  • De Man, J. C. 1983. MPN tables, corrected. European Journal of Applied Microbiology and Biotechnology 17 (5):301–305. doi:10.1007/BF00508025.
  • Emampour, M., P. C. IJzerman-Boon, M. A. Manju, and E. R. van den Heuvel. 2021. Optimal spiking experiment for non-inferiority of qualitative microbiological methods on accuracy with multiple microorganisms. Statistics in Biopharmaceutical Research. doi:10.1080/19466315.2021.2011397.
  • EP. 2017. 5.1.6. Alternative methods for control of microbiological quality. In European Pharmacopeia, Vol. 9.2, 4339–4348. Strasbourg: EDQM.
  • Farrington, C. P., and G. Manning. 1990. Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Statistics in Medicine 9 (12):1447–1454. doi:10.1002/sim.4780091208.
  • Garthright, W. E., and R. J. Blodgett. 1996. Confidence intervals for microbiological density using serial dilutions with MPN estimates. Biometrical Journal 38 (4):489–505. doi:10.1002/bimj.4710380415.
  • Garthright, W. E., and R. J. Blodgett. 2003. FDA’s preferred MPN methods for standard, large or unusual tests, with a spreadsheet. Food Microbiology 20 (4):439–445. doi:10.1016/S0740-0020(02)00144-2.
  • IJzerman-Boon, P. C., and E. R. van den Heuvel. 2015. Validation of qualitative microbiological test methods. Pharmaceutical Statistics 14 (2):120–128. doi:10.1002/pst.1663.
  • Lachenbruch, P. A., and C. J. Lynch. 1998. Assessing screening tests: Extensions of McNemar’s test. Statistics in Medicine 17 (19):2207–2217. doi:10.1002/(SICI)1097-0258(19981015)17:19<2207::AID-SIM920>3.0.CO;2-Y.
  • Manju, M. A., E. R. van den Heuvel, and P. C. IJzerman-Boon. 2019. A comparison of spiking experiments to estimate the detection proportion of qualitative microbiological methods. Journal of Biopharmaceutical Statistics 29 (1):30–55. doi:10.1080/10543406.2018.1452027.
  • Patil, G. P., and S. Bildikar. 1966. Identifiability of countable mixtures of discrete probability distributions using methods of infinite matrices. Proceedings of the Cambridge Philosophical Society 62 (3):485–494. doi:10.1017/S030500410004010X.
  • Strijbosch, L. W. G., R. J. M. M. Does, and W. Albers. 1990. Multiple-dose design and bias-reducing methods for limiting dilution assays. Statistica Neerlandica 44 (4):241–261. doi:10.1111/j.1467-9574.1990.tb01284.x.
  • USP. 2015. <1223> Validation of alternative microbiological methods. In United States Pharmacopoeia, Vol. USP40-NF35, 1756–1770. Rockville, MD: U.S. Pharmacopoeial Convention.
  • Van den Heuvel, E. R., and P. C. IJzerman-Boon. 2013. A comparison of test statistics for the recovery of rapid growth-based enumeration tests. Pharmaceutical Statistics 12 (5):291–299. doi:10.1002/pst.1581.

Appendix 1:

Proof that all samples are independent

Theorem

Suppose the number of organisms in the stock solution follows a Poisson distribution: N ~Poiλ. Assume that this stock solution is split into n equal samples with numbers of organisms X1,X2,,Xn. Conditional on the total number of organisms N in the stock solution, the numbers in the samples follow a multinomial distribution, with probabilities 1/n for each organism to end up in one of the samples: X1,X2,,Xn|N~MultN;1/n,,1/n.

Let Z1,Z2,,Zn0,1 denote the test outcome for each sample (0= negative, 1= positive). Assume that each sample is tested with a test method that detects organisms according to binomial detection model (7) with detection proportion θi, which may differ for each sample i=1,,n: PZi=1|Xi=11θiXi.

Then:

  1. the true numbers of organisms X1,X2,,Xn in the test samples follow an independent Poisson distribution, with marginal distribution: X1,X2,,Xn~Poiλ/n.

  2. the positive/negative test results Z1,Z2,,Zn are independent with Bernoulli probabilities: PZi=1=1expθiλ/n, or equivalently, PZi=0=expθiλ/n.

Proof

(a) This follows from Theorem 1 and its Corollary 4 in Section 4 (about mixtures on the total number (N) of the multinomial distribution) of Patil and Bildikar (Citation1966).

(b) In order to prove independence, i.e. that for all z1,,zn0,1
PZ1=z1,,Zn=zn=PZ1=z1PZn=zn,

it is sufficient to prove that:

(A1) PZ1=0,,Zn=0=PZ1=0PZn=0,(A1)

since the occurrence of 0 and 1 are complementary events, and for events A1,,An, independence of the events themselves also implies independence when one or more of the events is replaced by its complementary event, i.e.

PA1,,An=i=1nPAi implies PA1c,A2,,An=PA1ci=2nPAi,

since we can write

PA1c,A2,,An=PA1c|A2,,AnPA2,,An=
=1PA1|A2,,AnPA2,,An=
=1PA1,A2,,AnPA2,,AnPA2,,An=
=PA2,,AnPA1,A2,,An=
=1PA1i=2nPAi=
=PA1ci=2nPAi.

On the left-hand side of (A1), we have

PZ1=0,,Zn=0=
=N=0 x1=0x1++xn=NNxn=0NPZ1=0,,Zn=0|X1=x1,,Xn=xn,N=N
P(X1=x1,,Xn=xn|N=N)PN=N=

(Zi|X1,,Xn,N are conditionally independent therefore this is just the product)

=N=0x1=0Nxn=0Ni=1nPZi=0|X1=x1,,Xn=xn,N=N
P(X1=x1,,Xn=xn|N=N)PN=N=

(if we know Xi, then the other X’s do not matter, whether they are independent or not)

=N=0x1=0Nxn=0Ni=1nPZi=0|Xi=xiPX1=x1,,Xn=xn|N=NPN=N=
=N=0x1=0Nxn=0Ni=1n1θixiN!x1!xn!1nNλNeλN!=

(multinomial theorem a1++anN=x1++xn=NNx1,,xna1x1anxn)

=N=0nθ1θnN1nNλNeλN!=

(Poisson distribution Poi1i=1nθi/nλ adds up to 1 over all N)

=N=01i=1nθi/nλNeλ1i=1nθi/n+λ1i=1nθi/nλN!=
=1eλ1i=1nθi/nλ=eλi=1nθi/n.

On the right-hand side of (A1), we have for Z1 and similarly for Z2,,Zn

PZ1=0=N=0 x1=0 x1++xn=NNxn=0NPZ1=0|X1=x1,,Xn=xn,N=N
PX1=x1,,Xn=xn|N=NPN=N=

(Z1 only depends on X1)

=N=0x1=0Nxn=0NPZ1=0|X1=x1PX1=x1,,Xn=xn|N=NPN=N=
=N=0x1=0Nxn=0N1θ1x1N!x1!xn!1nNλNeλN!=
=N=0x1=0Nxn=0N1θ1x1N!x1!Nx1!Nx1!x2!xn!1nNλNeλN!=
=N=0x1=0N1θ1x1N!x1!Nx1!x2=0Nx1xn=0Nx1Nx1!x2!xn!1n1Nx1n1nNx11nx1λNeλN!=

(multinomial distribution MultinNx1;1n1,,1n1 adds up to 1)

=N=0x1=0NN!x1!Nx1!1n1nNx11θ1nx1λNeλN!=

(Newton’s binomium a+bN=x=0NNxaxbNx)

=N=01θ1n+n1nNλNeλN!=

(Poisson distribution Poi1θ1/nλ adds up to 1 over all N)

=N=01θ1/nλNeλ1θ1/n+λ1θ1/nλN!=eλθ1/n.

Hence, the left-hand side PZ1=0,,Zn=0=eλi=1nθi/nequals the right-hand side PZ1=0PZn=0=i=1neλθi/n, thereby completing the proof.

Appendix 2

Program codes for analysis

Appendix 3

Additional simulation results

Simulations supplementing those in (Type I error USP1 vs gMPN)

Simulations supplementing those in (Power USP1 vs gMPN)

Simulations supplementing those in (Type I error USP2 vs gMPN)

Simulations supplementing those in (Power USP2 vs gMPN)

Table C1. Type I error rate (%) to conclude non-inferiority with approaches USP1 and gMPN for a single dilution design using a non-inferiority margin of r0=0.8, detection proportion θC= 0.8, θA chosen such that θA/θC=0.7<r0, and n=200 samples.

Table C2. Power (%) to conclude non-inferiority with approaches USP1 and gMPN for a single dilution design using a non-inferiority margin of r0=0.8 or r0=0.7, θA/θC=0.9 with detection proportions θA,θC=0.72, 0.8 and n=200 samples.

Table C3. Power (%) to conclude non-inferiority with approaches USP1 and gMPN for a single dilution design using a non-inferiority margin of r0=0.8 or r0=0.7, θA/θC=1.1 with detection proportions θA,θC=0.88, 0.8 and n=200 samples.

Table C4. Type I error rate (%) to conclude non-inferiority with approaches USP2 and gMPN for multiple dilution designs using a non-inferiority margin of r0=0.8, detection proportions θA,θC=0.56, 0.8 and n200 samples.

Table C5. Power (%) to conclude non-inferiority with approaches USP2 and gMPN for multiple dilution designs using a non-inferiority margin of r0=0.8 or  0=0.7, θA/θC=0.9 with detection proportions θA,θC=0.72,0.8 and n200 samples.

Table C6. Power (%) to conclude non-inferiority with approaches USP2 and gMPN for multiple dilution designs using a non-inferiority margin of r0=0.8 or  0=0.7, θA/θC=1.1 with detection proportions θA,θC=0.88,0.8 and n200 samples.