2,548
Views
4
CrossRef citations to date
0
Altmetric
Oncology

How accurate are the longer-term projections of overall survival for cancer immunotherapy for standard versus more flexible parametric extrapolation methods?

ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Pages 260-273 | Received 09 Sep 2021, Accepted 14 Jan 2022, Published online: 11 Feb 2022

Abstract

Aims

To assess the accuracy of standard parametric survival models, spline models, and mixture cure models (MCMs) fitted to overall survival (OS) data available at the time of submission in the NICE HTA process compared with data subsequently made available.

Methods

Standard parametric distributions, spline models, and MCMs were fitted to OS data presented in single technology appraisals (TAs) for immune-checkpoint inhibitors (ICIs) in cancer. For each TA, the estimated survival from the fitted models was compared with Kaplan–Meier (KM) data that were made available following the HTA submission using differences between point estimates and restricted area under the curve (AUC) at both the midpoint and the end of additional follow-up. Differences in interval AUC values (calculated for each 6-month period) were also assessed.

Results

Standard parametric survival models and spline models were more likely to underestimate longer-term survival, irrespective of the measure used to assess model accuracy. MCMs were more likely to overestimate survival; however, this was improved in some cases by applying an additional hazard of mortality for “statistically cured” patients.

Limitations

The accuracy of the models was assessed based on much shorter OS data than the period for which extrapolation is needed, which may impact conclusions regarding the most accurate models. The most recent TAs for ICIs have not been captured.

Conclusions

There are no definitive findings that unquestionably support the use of one specific extrapolation technique. Rather, each has the potential to provide accurate or inaccurate extrapolation to longer-term data in certain circumstances, but the added flexibility of more complex models can be justified for treatments, like ICIs, that have extended survival for patients across disease areas. The use of mortality adjustments for “statistically cured” patients allows decision-makers to explore more conservative scenarios in the face of high decision uncertainty.

JEL Classification Codes:

Introduction

Uncertainty around lifetime survival projections based on short-term regulatory trial data are often, if not always, central to decision uncertainty in health technology assessment (HTA). The time between the initiation of new clinical trials, regulatory approval, and subsequent HTA submissions is becoming shorter. To accelerate patient access to innovative medicines identified as having promising benefits, regulatory approvals, and submissions are often based on less mature data and surrogate outcomes or proxies for overall survival (OS) such as progression-free survivalCitation1. However, this leads to a reduction in the extent of long-term clinical trial evidence available to support HTA decisions. As explored by Quinn et al.Citation1 the issue is further exacerbated for novel treatment options including immune-checkpoint inhibitors (ICIs), a type of immuno-oncology therapy. The mechanism of action for these therapies can lead to a delayed, but lasting, clinical response due to the timing of the immune system response and thus an expectation of marked improvement in clinical outcomesCitation2. ICIs have been demonstrated to provide substantial improvements in the long-term survival of patients (the extent of which varies by technology and indication); however, these improvements are often not captured well when the available clinical trial data are less matureCitation3,Citation4.

Despite these data limitations, long-term estimates of time-to-event outcomes (including OS) are still required to support HTA decisions, given the request of HTA agencies such as the UK National Institute for Health and Care Excellence (NICE) to capture the full benefits (and costs) experienced by patients over their lifetimesCitation5. Thus, it is necessary to use methods that allow extrapolation beyond the trial period. The most common methods for extrapolating time-to-event outcomes are standard parametric survival models, the use of which is well documentedCitation6. However, these models often do not provide sufficient flexibility to reflect the long-term outcomes anticipated for ICIs. This issue has more recently led to the increased consideration in technology appraisals (TAs) of more flexible methods for long-term extrapolation including spline and mixture cure modeling methods and, subsequently, the production of new guidance covering more flexible extrapolation methodsCitation7. Nevertheless, the most appropriate approaches for accurate long-term extrapolation remain unclear.

Bullement et al.Citation8 assessed the accuracy of the extrapolation methods preferred by manufacturers, evidence review groups, and NICE committees for a set of NICE single technology appraisals in oncology when considering subsequent trial data cuts becoming available after the original submission. The review targeted TAs for immunotherapies in cancer indications. Using the TAs identified as part of this prior review and the associated evidence base, we assessed the ability of standard parametric survival models, mixture cure models, and spline models to accurately capture longer-term survival outcomes. For this, we fitted the models to the data used within each HTA submission and compared the generated outcomes with those observed in the longer follow-up survival data. Our aim was to identify any features that may a priori suggest whether some of these methods would be more appropriate for long-term extrapolation under specific circumstances.

Methods

The analyses presented in this paper are based on the systematic search of the NICE website conducted by Bullement et al.Citation8 to identify relevant TAs for immunotherapiesCitation9. This review focused on NICE as the most transparent source of HTA submission evidence and the extensive information that NICE shares publicly. The details of this review have been presented in full previously, but in summary: the search was limited to completed single TAs published between 1 March 2000 and 31 December 2017. The US National Cancer Institute (NCI) definition of an ICI was used to identify all licensedCitation10 technologies by the European Medicines Agency (EMA) and US Food and Drug Administration (FDA) at the time. This resulted in the following list of cancer ICIs: ipilimumab, pembrolizumab, nivolumab, atezolizumab, durvalumab, and avelumab. To identify available OS data from more recent data cuts than reported in the TA materials, Bullement et al.Citation8 conducted a targeted literature search for the relevant pivotal trials associated with each TA. At least one later data cut was required for the inclusion of each TA to allow the accuracy of extrapolations to be assessed; otherwise, the TA was excluded. In addition, if the clinical trial associated with a TA had multiple published data cuts since the TA, the most recent one was used, to provide the most relevant long-term data (as applied by Bullement et al.).

For each TA, OS Kaplan–Meier (KM) plots were digitized for the ICI treatment arms (using digitization software GetData Graph Digitizer)Citation11 to create pseudo-patient-level data using the algorithm of GuyotCitation12. Pseudo-patient-level OS data were generated from the data cut of the pivotal trial that was presented during the NICE submission, and from the most recent data cut reporting OS KM data. Given changes in follow-up, new events observed and changes in censor times, the end of the earlier KM curve in particular can appear quite different to the shape of the later KM curve. Parametric survival analysis methods that allowed for extrapolation beyond the trial period were then applied to pseudo-patient-level data from the earlier data cut for each TA. Standard parametric models that have been generally recommended until recently were compared with two more flexible analysis methods, which are commonly used in cases where the data and clinical rationale suggest that long-term survival is plausible for a cohort of patients. The three analysis methods included:

  • Standard parametric survival models. These models are documented in detail in the NICE Decision Support Unit (DSU) Technical Support Document (TSD) 14 and are among the most common survival models used in NICE submissionsCitation6. Fitted distributions included the exponential, Weibull, Gompertz, log-logistic, log-normal, and generalized gamma, as well as the generalized F distribution, which allows additional flexibility with the inclusion of a fourth parameter. All standard parametric models were fitted in R using the flexsurv packageCitation13.

  • Spline modelsCitation14. These models reflect complex underlying hazards (and thus survival) within the follow-up period through the use of piecewise polynomials, such that the trajectory of the survival curve can change between selected time intervals, determined by the placement of knots. Spline models are considered an extension of a true piecewise model, where the changes between time intervals are smoothed. This flexibility can be used to better reflect changes in hazards that are not always possible to capture within standard parametric distributionsCitation14. However, the use of spline models in economic evaluations is limited, most likely due to the increased difficulty in interpreting these models and, importantly, in providing a clinical justification for their useCitation15,Citation16. Three spline-based models were considered: the proportional hazards, proportional odds, and the probit models. Each of these models a different transformation of the survival function. Splines were limited to a maximum of three (internal) knots to limit the risk of overfitting, particularly as it is considered unlikely that there is clinical justification to support high numbers of knots, and there seems to be a minimal gain from using more than three internal knotsCitation14. To remove any subjectivity of knot placement, knots were placed at equally spaced quantiles of the log uncensored survival time of the KM curve for each TA. In addition, the placement of knots is not considered a key driver of model fitCitation14. All splines in our study were fitted in R using the flexsurv packageCitation13.

  • Mixture cure models (MCMs). In general, mixture models are structured to reflect the assumption that there are two (or more) cohorts of patients in the population with differing survival outcomes, and the overall population contains a mixture of these two groupsCitation7. MCMs are a specific type of mixture model with two groups: one that is considered to reflect those who do not respond well to treatment and have poor outcomes (“uncured”) and one that reflects the outcomes of patients that are anticipated to experience long-term survival (“statistically cured”)Citation7,Citation17,Citation18. The mixture of patient groups is captured via the estimation of a “cure” fraction, reflecting the proportion of patients who experience long-term survival (“statistically cured”) versus those who do not. The fitted MCM estimates the proportion of patients who do experience long-term survival, as well as a survival model for short-term survivors. For short-term survivors, survival is modeled using the standard parametric survival models listed previously. The survival for long-term survivors is primarily modeled based on age- and gender-matched general population mortality applied from the time of entering into a study and throughout their lifetimeCitation19. In our study, all MCMs were fitted to absolute survival in R using the flexsurvcure package, guided by the approach presented by Grant et al.Citation20, with further details provided in Appendix 3.

Because it was considered optimistic to assume that the survival of long-term survivors follows that of the general population, the application of excess mortality to long-term survivors was also considered when fitting MCM. Published American Joint Committee on Cancer (AJCC) registry data were used to identify long-term, cancer-type-specific survival, where available and appropriate, to estimate the increased risk above general population mortalityCitation21. In each case, the most appropriate long-term survival curve(s) presented by the AJCC were selected to align most closely with the patient population of the primary trial evidence associated with each TA. Where the trial includes patients at multiple disease stages, the separate survival curves by stage presented by the AJCC were digitized and combined. The increased hazard of mortality for long-term survivors was estimated by comparing age- and gender-matched general population survival estimated for each specific TA with the observed survival of long-term survivors in the indication. US mortality estimatesCitation22 were used for general population survival estimates to align with the AJCC data that were collected in the US. The increased hazard for long-term survivors was estimated by first rebasing the estimated general population survival and the registry data survival curve from the approximated point of inflection in the registry data, which suggests only long-term survivors remain alive at that time. Further details on the relevant AJCC data source selected for each TA and the estimated additional hazards of mortality are presented in the Appendix ().

Following the application of the survival extrapolation approaches to the recreated pseudo-patient-level data from the initial data cut presented in the NICE submission, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) statistics were calculated. These goodness-of-fit statistics are commonly used in HTA submissions to support the identification of suitable model fits and were recommended for consideration in the NICE DSU TSD 14Citation6. However, AIC and BIC only capture the goodness-of-fit to the observed data, and the more recent NICE DSU TSD 21 cautions against the over-interpretation of these values when extrapolating beyond the study follow-upCitation7. The new guidance states that model choice should take into consideration the statistical appropriateness and clinical plausibility of survival estimates beyond the observed data for the fitted models. We assessed whether models with good statistical fit to the earlier data cut were aligned with models that provided the most accurate estimates for the longer-term data to understand how reliable goodness-of fit statistics are when used to support model choice. An overall understanding of model fit was explored using plots that compared the models fitted to the initial KM curve and the initial and later KM curves.

Multiple summary measures have been used to further quantify the accuracy of the model projections based on the earlier data cuts when compared with the observed OS KM curves from the later data cuts, including:

  • Comparison of the point estimates for survival between the model projection with later OS KM data, summarized based on the difference in the point estimates (dPE; ).

  • Comparison of the estimated restricted area under the curve based on the fitted survival model with the later KM curve, summarized based on the difference between the restricted area under the curve (dRAUC; ).

Figure 1. Estimation of dPE, dRAUC and dIAUC. Abbreviations. dIAUC, difference between the interval area under the curve; dPE, difference in the point estimates; dRAUC, difference between the restricted area under the curve; FoU, follow-up; KM, Kaplan-Meier.

Figure 1. Estimation of dPE, dRAUC and dIAUC. Abbreviations. dIAUC, difference between the interval area under the curve; dPE, difference in the point estimates; dRAUC, difference between the restricted area under the curve; FoU, follow-up; KM, Kaplan-Meier.

Both quantities were calculated at the midpoint between the end of follow-up of the original data and the end of follow-up of the more mature data. The use of the midpoint follows the rationale presented by Bullement et al.Citation8 for a pragmatic trade-off between robustness of the most mature KM data (sufficient patient numbers and increased follow-up time) and the relevance of addressing the research question (i.e. how accurately do the fitted models capture longer-term survival outcomes?). Comparisons of the model projections with the KM curves were also made at the maximum follow-up time of the later KM curve in scenario analyses. This approach used the maximum follow-up information available for each TA and closely aligned with the objectives of this investigation. However, this approach is likely to be more uncertain due to the low numbers at risk generally available at the end of a KM curve.

All differences were calculated based on the estimate from the fitted model minus the estimate from the observed KM curve, such that a positive difference would suggest that the model overestimated the observed KM survival when compared with later data and vice versa (). Given the quantity of data, the differences in point estimates and AUC were ranked from the smallest to largest absolute values and the models that were ranked one of the three highest (i.e. the three smallest values) were used to identify the most accurate model types for each TA.

In addition, comparison of the area under the curve with the fitted survival model and the later KM curve for each 6-month interval was explored and summarized as the difference in 6-month interval AUC (dIAUC; ). Interval estimates are used to explore where deviations between the fitted model and the observed later KM curve occurred (i.e. whether the AUC was consistently over- or underestimated in each 6-month interval of the follow-up period).

Results

The systematic search conducted by Bullement et al.Citation8 identified 11 relevant TAs for ICIs (details of this are published elsewhere). Of the 11 TAs identified, 10 were suitable for use in this investigation. One of the TAs (TA400) did not provide an original KM curve, and thus the accuracy of fitted models could not be assessed in this caseCitation23.

The sample sizes, follow-up times for the initial KM curves, and the overall proportion of models that underestimate survival were summarized by TA for the dPE and dRAUC measures at both the midpoint and end of follow-up (; further detail of dPE and dRAUC values is included in the Appendix [ and Citation4]). Additionally, to reflect model accuracy, the proportion of models within two standard errors of the KM estimate at both the midpoint and end of follow-up were also summarized for dPE.

Table 1. Model fit summary.

The proportion of models that underestimated survival is generally consistent between the two measures used (dPE and dRAUC). There was more variation in the proportion of models that underestimated survival between the two different time points investigated (at the midpoint and at the end of follow-up); however, if the majority of fitted models generally underestimated (or overestimated) at the midpoint, then the same trend was seen at the end of follow-up and vice versa. Standard parametric survival and spline models were more likely to underestimate survival irrespective of the measure used or timepoint, and MCMs were most likely to overestimate survival. In general, smaller trials with shorter follow-up periods tended to be associated with projections that overestimated survival when compared with subsequent data cuts. The proportion of models within two standard errors of the KM estimate generally decreases from the midpoint to the end of follow-up. In general, spline models were most likely to provide accurate estimates, based on estimates within two standard errors of the KM point estimate, and this is consistent at both time points. However, both standard parametric models and MCMs also provide accurate results based on this measure in certain cases and generally, across the TAs, the majority of models provide estimates within two standard errors of the KM point estimate.

For each 6-month interval, the difference in interval AUC values was calculated (). The majority of fitted models led to a consistent or downward trajectory, suggesting a trend towards underestimating survival. There was a higher likelihood of model underestimation for those TAs with longer follow-up and larger patient numbers (TA319, TA417, and TA484) and a higher likelihood of overestimation for those TAs with shorter follow-up and lower patient numbers (TA447, TA483, and TA490). A majority of models underestimated survival both at the midpoint and end of follow-up; in total, seven of the TAs for standard parametric survival models, six of the TAs for splines, and four of the TAs for MCMs. Standard parametric survival models were most likely to provide extreme underestimations over time, while MCMs were most likely to lead to extreme overestimations. The MCMs that resulted in extreme overestimations were often also shown to provide poor accuracy based on dPE. For example, the MCM where a Gompertz distribution was used to model the survival for short-term survivors was consistently shown to lead to one of the most extreme overestimation trajectories (where the model increasingly overestimated survival over time). This aligned with the findings from Cooper et al.Citation24 where the MCM Gompertz was only shown to provide one of the three most accurate models for one TA, based on dPE at the midpoint of additional follow-up.

Figure 2. Six-month interval restricted mean survival (dIAUC) for each TA. Abbreviations. dIAUC, difference between the interval area under the curve; Exp, exponential; GF, Generalized F; GG, generalized gamma; Gom, Gompertz; LL, log-logistic; LN, log-normal; MCM, mixture cure model; PH, proportional hazards; PO, proportional odds; Pr, probit; Std., standard; PSM, parametric survival model; TA, technology appraisal; Weib, Weibull.

Figure 2. Six-month interval restricted mean survival (dIAUC) for each TA. Abbreviations. dIAUC, difference between the interval area under the curve; Exp, exponential; GF, Generalized F; GG, generalized gamma; Gom, Gompertz; LL, log-logistic; LN, log-normal; MCM, mixture cure model; PH, proportional hazards; PO, proportional odds; Pr, probit; Std., standard; PSM, parametric survival model; TA, technology appraisal; Weib, Weibull.

Expanding on the results presented previously by Cooper et al.,Citation24 spline models were least likely to provide an “extreme” over- or underestimate and led to dPEs that ranked within the lowest three dPE values in seven of the 10 included TAs, both at the midpoint and at the end of the additional follow-up. Standard parametric models were most likely to underestimate survival and led to the largest underestimates in nine of the 10 TAs considered. However, standard models also led to dPEs that ranked within the lowest three dPE values in four of the 10 TAs at the midpoint, increasing to seven at the end of the later KM follow-up. MCMs provided one or more of the three lowest dPEs for seven TAs calculated at the midpoint of the additional KM follow-up and for six TAs when the dPE was calculated at the end of the additional follow-up. However, MCMs were the most consistent model type to overestimate survival, and it was noted that this generally occurred for the trials with the lowest sample size and more immature follow-up for the initial data cut.

The three best- and worst-fitting models based on the dPE at the midpoint of the additional follow-up are explored further by TA in . In five of the 10 TAs the three worst-fitting models underestimate survival and in three cases they overestimate survival. The MCM models are heavily influenced by any demonstration of a “plateau” in the initial KM, which influences the later trajectory for these models. Where evidence for the plateau is weakest (i.e. where the survival probability is still high when follow-up time is short or where the numbers at risk are consistently low towards the end of the KM), this leads to cases where MCMs provide the largest overestimates of longer-term survival. Similarly, in these cases it is more likely that the standard parametric models will lead to the greatest underestimates of longer-term survival. The best-fitting spline models are less likely to be influenced by the end of the initial KM than the MCMs. As already highlighted by Cooper et al.,Citation24 generally, the lowest AICs (indicating the best-fitting model) did not consistently align with the lowest dPEs in any of the TAs.

Figure 3. KM and best and worst fitting models overlaid for each TA. Abbreviations. Exp, exponential; GF, generalized F; GG, generalized gamma; Gom, Gompertz; KM, Kaplan-Meier; LL, log-logistic; LN, log-normal; MCM, mixture cure model; PH, proportional hazards; PO, proportional odds; Pr, probit; Std., standard; PSM, parametric survival model; TA, technology appraisal; Weib, Weibull.

Figure 3. KM and best and worst fitting models overlaid for each TA. Abbreviations. Exp, exponential; GF, generalized F; GG, generalized gamma; Gom, Gompertz; KM, Kaplan-Meier; LL, log-logistic; LN, log-normal; MCM, mixture cure model; PH, proportional hazards; PO, proportional odds; Pr, probit; Std., standard; PSM, parametric survival model; TA, technology appraisal; Weib, Weibull.

Assessing model fit based on the difference in restricted AUC shows some consistency with the results based on dPE. MCMs were most likely to provide at least one of the three most accurate estimates (seven TAs at the midpoint and six at the end of the additional follow-up), followed by the standard parametric survival models (six and five at the midpoint and end of additional follow-up, respectively). Splines provided fewer of the most accurate models based on dRAUC values rather than dPE, although, unlike MCMs and standard parametric models, the accuracy of these models generally improved by the end of follow-up (three at the midpoint and six at the end of follow-up). This aligns with the finding that spline models were least likely to provide an “extreme” over- or underestimate towards the end of the follow-up and may be due to lack of prespecified performance expectation unlike MCMs.

In analyses presented previously by Cooper et al.Citation24, the accuracy of each fitted model was assessed based on dPE values at the primary midpoint timepoint. The proportion of models in our study that underestimated survival is presented in , indicating that MCMs were more likely to overestimate survival than both splines and standard parametric survival models. To reflect the possibility that, though being long-term survivors, these patients may still be at an increased mortality risk compared with the general population, an excess hazard for these patients was applied.

The estimated additional hazards of mortality applied to the MCMs for each TA are presented in the Appendix (). The estimated hazards are notably high in some cases, which could suggest that the point of inflexion is too soon or that the source of long-term survival may not be appropriate. However, these estimates can provide an example of a possible extreme case in each scenario.

The application of an additional hazard of mortality reduced the proportion of MCMs that overestimated survival (see ). In addition, despite the high additional mortality estimated in some cases, the application of an additional hazard for the proportion of patients estimated to be long-term survivors provides a similar or improved dPE (smaller absolute dPE at the midpoint) for at least one of the seven MCMs in all except one TA. In cases where the dPE at the midpoint is reduced following the application of additional hazard (highlighted in green in ), the average improvement (3.75%) is greater than cases where the dPE at the midpoint increased, with an average worsening (red cells) in the dPE of 2.59%. This suggests that the application of an additional hazard for mortality can improve the estimate of the longer-term survival. Importantly, adjustment for additional mortality reduced model accuracy for some TAs with the largest sample sizes and follow-up, with the greatest benefits in accuracy shown for trials with the lowest sample sizes and shortest follow-up.

Table 2. Comparison of MCM dPEs at the midpoint when adjusted for general population mortality or with the general population subject to additional excess mortality.

Following this adjustment, the influence of any potential “plateau” or low number of patients at risk in the KM curve is also reduced, as the additional mortality hazard is applied to the proportion of patients who were estimated to be statistically cured. For example, the MCM Gompertz curve for TA447 (the TA with the second smallest patient population) presented in overestimates survival at the midpoint by 7.90%; this is reduced to 3.10%, when the additional hazard for mortality is applied. Similarly, for TA483 (the TA with the smallest patient population), the overestimation of survival at the midpoint for the Weibull, Gompertz, and generalized gamma curves is reduced by 55.10–59.20% for the models with additional hazard compared with those without. Conversely, for the most conservative models, such as the MCM exponential, the application of excess mortality is more likely to reduce the accuracy of the fitted model based on differences in point estimates.

Discussion

The development of novel immunotherapies in oncology has further increased the challenges surrounding the estimation of long-term survival due to the use of survival data with limited follow-up. Two of the three key challenges identified by Quinn et al.Citation1 for assessing the long-term benefits of immunotherapies are the differing mechanism of action from which follows a lack of agreement on the ideal methods for extrapolating survival for these treatments. These challenges have led to increased consideration of more flexible methods for extrapolation and new guidance for these methodsCitation7. Our analyses provide additional insight into the accuracy of standard parametric models compared with two more flexible methods, splines and MCMs, which are being more frequently considered to support HTA decisions to account for the impact of the differing mechanism of action for ICIs.

For 10 TAs identified in a previously conducted systematic search, we compared the accuracy of the three types of survival modelsCitation8. These were fitted to the data cut used to support the NICE HTA decision and compared with survival data that were later made available. Along with visual inspection of the curves, three different measures of accuracy were considered: dPEs for survival and dRAUCs (both calculated at the midpoint of the additional follow-up and the end of the additional follow-up), as well as differences in the interval AUC, calculated for each 6-month interval.

Generally, each of the measures favors the same model type/types; that is, the model types that are most accurate based on dPEs were generally the most accurate models based on other measures (AUC and interval AUC values). However, there was some variation in the exact models in each overall model type that were most accurate. As presented previouslyCitation24, AIC did not align well with the models that provided the closest point estimates based on dPE and this also holds for restricted AUC, suggesting that it is likely inappropriate to use measures of statistical fit, like AIC and BIC, as a basis for choosing models that represent clinically plausible projections in the longer-term.

Based on dPE and dRAUC at the midpoint and the end of additional follow-up, standard distribution models were most likely to underestimate longer-term survival. The MCMs generally produced the largest proportion of models that overestimated survival and standard parametric distributions generally produced the largest proportion of models that underestimated survival. This latter observation aligns with that of Ouwens et al.Citation25 who also found that standard parametric models generally underestimated longer-term OS compared with later data.

Spline models and MCMs were both more likely than standard parametric survival models to provide one of the most accurate models based on dPE (within the three smallest dPE values). This was also the case for MCMs based on dRAUC. The number of cases where spline models or standard parametric survival models led to the most accurate survival estimates (within the three smallest dPEs) was consistent or increased when estimating the dPE at the end of follow-up, compared with the midpoint. For MCMs, this number decreased by one TA when estimating at the end of follow-up. For dRAUC, both standard parametric survival models and MCMs led to fewer of the most accurate models at the end of follow-up compared with the midpoint of initial follow-up, whereas this number increased for spline models. The accuracy of spline models based on dPE and dRAUC (where accuracy in the latter increased from the midpoint to end of follow-up) aligns with the findings of Gray et al., who found that spline models perform well at projecting OS fitted to registry cohorts with artificially right-censored follow-up timeCitation26.

The 6-month interval AUCs were explored to understand how the accuracy of model extrapolations changed over time. Generally, the tendency for standard models to produce the greatest underestimations and for MCMs to produce the greatest overestimations became more pronounced over time. Importantly, the poorest trajectories (largest overestimations or underestimations) for differences in 6-month interval AUCs did not generally align with the specific MCMs and standard parametric survival models that provide the most accurate estimates.

As stated previously by Cooper et al.Citation24, it is difficult to conclude which models will generally under- or overestimate survival. However, the authors concluded there was a higher likelihood of model overestimation with lowest sample sizes and more immature follow-up for the initial data cut, and this was consistent across dPE and dRAUC both at the midpoint and end of additional follow-up. In addition, for TAs with a short initial follow-up, the pattern of under- or overestimation was unclear. However, in cases where sample sizes, initial follow-up time, and the numbers at risk towards the end of initial follow-up were lowest – particularly when the KM curve shows a “plateau” based on these low numbers at risk – we observed that models were more likely to overestimate. Specifically, the trajectories of the MCMs were likely to be most sensitive to these unsupported “plateaus”.

The application of an additional hazard of mortality for the proportion of patients who were estimated to be statistically “cured” for the MCM extrapolations reduced the proportion of MCMs that overestimated longer-term survival and lead to greater improvement in accuracy on average. However, the TAs with the shortest follow-up and smallest sample sizes tended to benefit most from the application of additional hazard, whereas accuracy for survival estimates for TAs with the longest follow-up and largest sample sizes decreased in some cases. This observation is not unexpected given that that these trials are likely to better capture excess mortality in the observed trial outcomes. The consideration of external data for MCMs (and other more flexible models) is anticipated to be an essential requirement for the purposes of NICE HTA submissions based on the guidance presented in NICE TSD 21Citation7,Citation27. Given the possible impact on survival outcomes following the incorporation of external data, this study highlights the need to carefully consider and justify the selected source of the additional hazard of mortality (or other external data source), as well as the clinical plausibility of the selected models across the whole time period required for extrapolation.

Our assessment of model accuracy using a comparison of point estimates, AUC values, and interval AUC values, as well as a visual assessment of the curves, suggests that spline models and MCMs have the potential to provide accurate estimates of longer-term survival. Ouwens et al.Citation25 came to a similar conclusion for MCMs when investigating the accuracy of extrapolating OS for patients treated for non-small-cell lung cancer with durvalumab. Our assessment also suggests that all three model types can be appropriate options to estimate survival in certain TAs, again similar to Ouwens et al., who found it difficult to identify consistent uses of models to extrapolate survival data. Similarly, Roth et al.Citation28 found strong similarities in survival estimates between MCM and standard parametric survival extrapolations. The authors proposed that this was likely due to the lack of data maturity or the small proportion of long-term survivors and noted that in this situation, the use of MCMs may not be warranted. The clinical plausibility and rationale of the underlying assumptions of the applied methodology should, therefore, be carefully assessed when considering their use, particularly in the case of the more flexible spline models or MCMs.

We acknowledge that the presented analyses have some limitations. This research has been conducted based on a systematic search of the NICE website and the targeted literature search for later data cuts that were performed and presented by Bullement et al.Citation8 These were not updated for the purpose of these analyses. We acknowledge that a substantial number of additional TAs for ICIs are now available. The consideration of more recent ICI trial data with multiple data cuts (and of any more recent data cuts for the currently identified TAs) would allow the general trends identified in these analyses to be assessed and would be a valuable addition to this work.

We focused our comparison of model accuracy on standard parametric survival models, spline models, and MCMs; however, NICE TSD 21 guidance, published shortly after this work was conducted, includes these flexible model types as well as several other methods that have not been considered here. In addition, the assessment of model accuracy remains limited by the extent of additional follow-up provided for each TA, which is still considerably shorter than the length of the extrapolation required for HTA decision-making. For the time horizon estimates that are required to support economic model development, the clinical plausibility of extrapolations remains an essential consideration for any fitted model. Clinical plausibility and the use of external data for the purpose of validation or for incorporation into the survival extrapolation is discussed in detail in the recently published NICE DSU TSD 21Citation7. This study focuses on spline and MCM methods, which both feature in the NICE guidance. However, the authors acknowledge that there are a range of other extrapolation methods as well as methods for incorporating external data that have not been considered here. In particular, spline models (or flexible parametric models) that incorporate external data as part of the model fitting have been demonstrated to provide the potential for accurate long-term extrapolationsCitation7. In addition, as described above, the assessment of model accuracy is based on three different measures: dPEs, dRAUCs and differences in the interval AUC, calculated for each 6-month interval along with visual inspection of the curves. Of note for dRAUC in particular, this measure captures the overall difference in RAUC, and therefore may not capture instances where the over- and underestimates across the relevant time period cancel out, however this measure is used in combination with other measures of accuracy where this detail is better captured. The authors recognize that there are likely many alternative measures (such as differences in mean OS, or the proportion of time where models more closely reflect the KM curve) that could also be considered as part of future research.

The general population adjustment (with or without excess mortality) applied for the long-term surviving cohort of the MCMs for the purpose of extrapolation was conducted after the model was fitted. The incorporation of external data into the model fitting process may influence the estimated cure fraction and subsequent model parameters of the short-term survivor cohort and therefore may affect these conclusions. However, our analyses demonstrate that the use of MCMs can be a viable option in some cases and has the ability to be more flexible for cost-effectiveness model requirements (such as to explore the impact of different mortality adjustments).

Finally, the estimation of excess mortality for the MCMs was based on a review of the AJCC registry information; this information was selected based on the most comparable disease indication and stage of disease to the population in each TA, but this was limited to what was available in the AJCC. In particular, there were several specific features for the populations in the TAs that could not be captured based on the data in the AJCC registry, including any differences between long-term outcomes between treated and untreated populations, the inclusion/exclusion of patients with mutations, brain metastases, and squamous/non-squamous disease.

Conclusions

  • The maturity of the initial follow-up data alone did not provide a clear indication of whether models were likely to under- or overestimate survival beyond the observed period. However, the potential to overpredict survival beyond the end of follow-up where data are most immature should be carefully considered.

  • Across the range of measures used to assess the accuracy of model fit, the more flexible spline and MCM models were generally demonstrated to have the potential to accurately reflect longer-term survival, but all models were shown to provide viable options for longer-term estimation of survival in certain cases. However, over- and under-estimations were still observed in a substantive number of cases with all approaches and so the rationale behind the use of any model should be carefully justified based on clinical plausibility and the relevant trial data.

  • The addition of excess mortality to the general population for MCMs may be a viable clinically plausible way to reduce the likelihood of MCMs (and other models) overestimating longer-term survival; however, identification of the source of excess hazard is subjective and should be carefully justified. The benefits in accuracy from this process are likely to be greatest in scenarios where trial follow-up is shortest and sample sizes smallest, and these aspects are useful to consider when making decisions on how best to estimate long-term outcomes.

  • Extrapolation to support HTA decision-making often requires lifetime projection, which is much longer than the additional follow-up available for these analyses. The clinical plausibility of the extrapolations across the whole time period for extrapolation should be carefully considered and justified when selecting models, particularly in cases where most data are immature.

Transparency

Declaration of funding

This study and manuscript were supported by funding from Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA, and Merck, Sharp & Dohme Limited, Hoddesdon, UK.

Declaration of financial/other interests

MC, SS, and TW are employees of BresMed. BresMed received consultancy fees from Merck Sharp & Dohme Ltd for the development of this research and the writing of this manuscript. The authors did not receive direct payment as a result of this work outside of their normal salary payments. RAI is an employee of Merck Canada Inc., a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA.

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Previous presentations

Earlier analyses of this study were presented as a poster (PCN293) at Virtual ISPOR US 2020.

Acknowledgements

We thank Tara Harding and Ash Bullement for their input during the development of this research, Will Sullivan for his support and input during the development of this manuscript, and Jake Horgan for his editorial support during the development of this manuscript.

References

Appendix 1.

dPE and dRAUC values at both the midpoint and end of follow-up for the three most accurate fitting models

Table 3. dPE and dRAUC values at both the midpoint and end of follow-up for the top three most accurate fitting models.

Appendix 2.

dPE and dRAUC values at both the midpoint and end of follow-up for the three least accurate fitting models

Table 4. dPE and dRAUC values at both the midpoint and end of follow-up for the top three least accurate fitting models.

Appendix 3.

Mixture cure models and additional mortality

The overall survival for the “mixture” of short- and long-term survivors was calculated using the survival for the constituent parts in combination with the proportion of patients, estimated by the “cure” fraction, that follow each survival trajectoryCitation20. The estimated proportion of long-term survivors (the “cure” fraction) was assumed to follow the derived general population mortality, and the estimated proportion of short-term survivors (1 – the “cure” fraction) followed survival estimated by the fitted parametric survival model. When incorporating additional mortality for long-term survivors, the estimated additional hazard was applied to the general population survival before following the same approach set out above.

presents details on the data used to derive additional hazards of mortality for long-term survivors when fitting MCMs. The additional hazard for long-term survivors was estimated by comparing AJCC registry data for specific tumor types to the age- and gender-matched general population survival estimates from each TA by first generating two sets of pseudo patient-level data for each TA: one from digitized AJCC data, per tumor type considered in each TA, described in and the other from the age- and gender-matched general population survival specific to each TA.

To estimate the excess mortality for long-term survivors, the survival from AJCC targeted to the relevant tumor type was compared to the general population survival after the time where it was believed that the patients remaining alive in the AJCC population were considered long-term survivors, that is, when survival for these patients appeared to level out after a certain length of time and reached an inflection point. This point of inflection was approximated by visual inspection of the AJCC curves to identify the timepoint at which the survival curve began to flatten. Simple hazard plots were also inspected to support the selection of the point of inflection. Once an approximate inflection point was determined (with the selected excess mortality source data timepoint shown in ), both sets of pseudo patient-level data were rebased to exclude data with any time value occurring before this. Cox proportional hazards models were then fitted to the rebased pseudo patient-level data, resulting in estimated excess hazard ratios for long-term survivors compared to the age- and gender-matched general population survival from each TA, that could then be applied to long-term survivors in the MCMs.

Table 5. Additional mortality for mixture cure models.