281
Views
39
CrossRef citations to date
0
Altmetric
Original Research

An evaluation of exact matching and propensity score methods as applied in a comparative effectiveness study of inhaled corticosteroids in asthma

, , , , , , , , & show all
Pages 15-30 | Published online: 22 Mar 2017
 

Abstract

Background

Cohort matching and regression modeling are used in observational studies to control for confounding factors when estimating treatment effects. Our objective was to evaluate exact matching and propensity score methods by applying them in a 1-year pre–post historical database study to investigate asthma-related outcomes by treatment.

Methods

We drew on longitudinal medical record data in the PHARMO database for asthma patients prescribed the treatments to be compared (ciclesonide and fine-particle inhaled corticosteroid [ICS]). Propensity score methods that we evaluated were propensity score matching (PSM) using two different algorithms, the inverse probability of treatment weighting (IPTW), covariate adjustment using the propensity score, and propensity score stratification. We defined balance, using standardized differences, as differences of <10% between cohorts.

Results

Of 4064 eligible patients, 1382 (34%) were prescribed ciclesonide and 2682 (66%) fine-particle ICS. The IPTW and propensity score-based methods retained more patients (96%–100%) than exact matching (90%); exact matching selected less severe patients. Standardized differences were >10% for four variables in the exact-matched dataset and <10% for both PSM algorithms and the weighted pseudo-dataset used in the IPTW method. With all methods, ciclesonide was associated with better 1-year asthma-related outcomes, at one-third the prescribed dose, than fine-particle ICS; results varied slightly by method, but direction and statistical significance remained the same.

Conclusion

We found that each method has its particular strengths, and we recommend at least two methods be applied for each matched cohort study to evaluate the robustness of the findings. Balance diagnostics should be applied with all methods to check the balance of confounders between treatment cohorts. If exact matching is used, the calculation of a propensity score could be useful to identify variables that require balancing, thereby informing the choice of matching criteria together with clinical considerations.

Supplementary materials

Additional methods

Methods of matching and causal analysis

We evaluated exact matching, propensity score matching (PSM), the inverse probability of treatment weighting (IPTW), covariate adjustment using the propensity score, and propensity score stratification. The IPTW, covariate adjustment, and stratification methods differ from PSM in that they retain the full dataset (so no biases are introduced through patient selection) but use the propensity score in other ways to achieve balance (ie, not just for matching patients).

Propensity score matching

For PSM, patients are matched on one variable, the estimated propensity score or logit of the propensity score within a predefined caliper, usually employing a 1:1 matching ratio although other ratios can be considered, as appropriate to the data. Because the precision of the propensity score is based on the inclusion of potential confounders into the statistical regression model used for its estimation, the true propensity score is not known. As a consequence, residual confounders can persist even after the application of the propensity score approaches.

Therefore, after applying PSM, we conducted a balance assessment by repeating the baseline analysis to ensure that the balance between cohorts was obtained and to test whether the propensity score model was adequately specified. We respecified the propensity score model by adding more variables (based on previous research experience), interactions, and non-linear terms until appropriate balance was obtained. Balance between cohorts was evaluated by comparing summary statistics of baseline variables via comparison of P values, using conditional logistic regression with significance set at P<0.05, and via use of standardized differences to compare mean values and prevalence of baseline variables; balance was considered achieved for differences lying within a 10% window. Standardized differences were calculated using a SAS macro developed by Yang and Dalton and available via the website of the Lerner Research Institute.Citation1

Additional results

Exact matching

Exact matching retained the fewest patients (2488) and so was the lowest powered and least likely to be representative of the full population. Indeed, patients in the ciclesonide cohort selected for matching were marginally less severe than the overall unmatched population. Adjustment for residual confounders after matching made only modest differences (particularly in the analysis of overall asthma control for which adjustments in some other methods made quite large differences), suggesting that the matching was effective in reducing confounding. All models were adjusted for evidence of gastroesophageal reflux disease (GERD). This was not a matching variable and significant differences (41% vs. 34% in ciclesonide vs. fine-particle cohorts) remained at baseline after matching; standardized differences were in excess of 10%. Calculation of the propensity score showed this to be a strong predictor of treatment allocation, which maybe could have been improved by using the propensity score to influence choice of exact matching criteria. It would have been interesting to repeat the exact matching process, matching also on evidence of GERD, although the gain in balance across treatment arms would need to be weighed against a further loss in sample size and therefore power.

Propensity score matching

A list of 12 covariates to use for the propensity score estimation was identified after excluding 7 collinear variables and 3 variables not contributing to the final model (). Baseline daily short-acting β2-agonist (SABA) dose and evidence of GERD both strongly influenced the propensity score (see Table S1 for correlation coefficients).

Both algorithms used to match on the propensity score (Research in Real-Life [RiRL] matching algorithm and greedy algorithm) retained similar numbers of patients (2642 and 2646, respectively). In the PSM dataset produced using the RiRL algorithm, there were no significant differences between cohorts in baseline variables at the 5% level. A trend (P<0.10) was recorded for shorter duration of asthma (P=0.083), lower mean daily SABA doses (P=0.096 on the ratio scale), and less short-acting muscarinic antagonist use (P=0.056) in the ciclesonide cohort as compared with the fine-particle inhaled corticosteroid cohort. Using the greedy algorithm, there was one significant difference between cohorts at the 5% level (higher incidence of GERD in the ciclesonide cohort; P=0.022) and a trend (P=0.094) for shorter duration of asthma in the ciclesonide cohort.

Adjustment for residual confounders after matching made only modest differences except in the analysis of overall asthma control, suggesting that the matching was, generally, effective in reducing confounding. Unadjusted and adjusted odds ratios (ORs) for risk-domain asthma control were lower than the unmatched and unadjusted, whereas ORs were higher using all other methods, which likely reflects the sample of patients selected. Furthermore, the adjusted ORs for risk-domain and overall asthma control were lower than the unadjusted ORs when using PSM with RiRL algorithm, whereas adjustments to the model in all other analysis methods increased the ORs. This apparent anomaly was driven by the magnitude of the residual baseline difference in evidence of GERD between cohorts (negligible difference using the RiRL algorithm, significant differences in other datasets), further confirming that models and results were sensitive to the sample selected, including both the absolute and relative severity of the patients and residual baseline and standardized differences between cohorts in key variables.

Inverse probability of treatment weighting

Stabilized weights – which multiply the IPTW by the unconditional probability of treatment allocation – were used to create a pseudo-dataset with sample size of 4063, so near-preserving the sample size of the original data. There were statistically significant differences between cohorts in mean daily SABA doses and prescriptions for allergies when measured on the ratio/interval scale, but cohorts were balanced when these variables were categorized; there were no other significant differences between treatment arms at the 5% level ( in the main paper). There was a trend (P=0.062) for a different distribution of severe exacerbations across treatment arms with greater proportions of patients in the ciclesonide cohort in the 0 and ≥2 categories. The largest differences were seen in baseline exacerbations (categorized) (0.08) and age group (0.09). Standardized differences were within the −0.1 to 0.1 range for the weighted pseudo-dataset, including the two variables where there remained a statistically significant difference at baseline; however, as AustinCitation2 notes, statistical significance is not the recommended method to assess balance and the standardized differences confirmed that an acceptable balance was achieved.

Adjustment for residual confounders made only minimal differences to the exacerbation and risk-domain asthma control endpoints, and no difference to the change in therapy endpoint, suggesting that the weighting was effective in reducing confounding. Adjusting for residual confounders increased the OR for overall asthma control, driven mainly by the residual difference in SABA use between cohorts. Overall, this method seemed effective in estimating the average treatment effect using the full power of the original dataset without selection bias. By using the stabilized weightings, treatment effects and their variances could be estimated simply using conventional modeling methods, with adjustments for any residual confounding.

Covariate adjustment using the propensity score

Covariate adjustment using the propensity score gave results consistent with other methods for all endpoints. Further adjustment was limited (outcome exacerbation rate was additionally adjusted for baseline exacerbations; outcome risk-domain asthma control and overall asthma control were additionally adjusted for baseline risk-domain asthma control status), but other adjustments (baseline SABA use, evidence of GERD) correlated strongly with the propensity score leading to collinearity in the model. As an exercise, we adjusted endpoint models for component baseline confounders rather than the propensity score and compared the results. Certainly for the exacerbation and risk-domain asthma control endpoints, there was very little difference in results between adjusting for the propensity score plus baseline exacerbation count and risk-domain asthma control status, respectively, than a full covariate list, but the propensity score adjustment made the models more parsimonious and simpler to reduce. There was more variation in results for the overall asthma control endpoint, but the fully adjusted model was very sensitive to adjustments and again, propensity score adjustment provided a simple and parsimonious option. When many potential confounders are involved, many of which may be collinear, the final model choice can be subjective. Thus, provided the propensity score is correctly specified, adjusting for the propensity score provides a simple, effective, and repeatable method to account for differences between treatment arms.

Interestingly, the propensity score was not a significant covariate in the model for therapy change; adjustments to the treatment effect could be made by adjusting for evidence of GERD and rhinitis and, for this endpoint, seemed the preferable option. The nonsignificance of the propensity score in this model, and the general stability of the therapy change results across all analysis methods, suggests that this endpoint was quite robust and not greatly influenced by treatment bias.

Stratification by propensity score

This method would not be an appropriate choice in a study where the primary endpoint is exacerbation rate. A negative binomial model cannot be stratified (using PROC GEN-MOD), and, whereas a Poisson model can be stratified, the unadjusted, stratified model took several hours to run, and there was insufficient memory to stratify and additionally adjust, even for one additional variable. Furthermore, the unadjusted stratified model gave identical results to the unmatched, unadjusted model. Stratification was possible and practical for the dichotomous endpoints.

Steering committee

The following independent steering committee agreed the study design and methods of the current study, before seeking approval from the governance board of the PHARMO database (detailed in the “Ethical approval” section of the main article).

Steering Committee

  1. Professor Dirkje Postma, Groningen University, The Netherlands;

  2. Thys van der Molen, Department of General Practice, University Medical Center Groningen, University of Groningen, The Netherlands;

  3. Dr Elliot Israel: Brigham and Women’s Hospital and Harvard Medical School, USA;

  4. Dr Gene Colice: George Washington University School of Medicine, Washington, DC, USA.

Table S1 Correlation coefficients between the propensity score and its components, ranked in order of absolute magnitude

Table S2 Unadjusted results for study endpoints

References

Acknowledgments

We thank R Brett McQueen and Joan B Soriano for reviewing the manuscript and offering useful feedback.

This study was funded in equal parts by Takeda Pharmaceuticals International GmbH, Zurich, Switzerland; and by Research in Real-Life Ltd, UK, under a subcontract by Observational and Pragmatic Research Institute Pte Ltd, Singapore.

The dataset supporting the conclusions of this article is not available because the data were derived from a proprietary database provided by the PHARMO Database Network.

Author contributions

AB, JMK, DvE, and DBP developed the protocol for the study. RMCH and JAO provided expertise regarding use of the PHARMO database. AB and CM conducted the analyses, and EVH developed the first draft of the manuscript. All authors were involved in the interpretation of the data and the critical review and revision of the manuscript. All authors read and approved the final manuscript.

Disclosure

AB and CM were employees of Research in Real-Life (RiRL), Cambridge, UK. Research in Real-Life was subcontracted by Observational and Pragmatic Research Institute Pte Ltd, Singapore, to conduct this study and has conducted paid research in respiratory disease on behalf of the following other organizations in the past 5 years: Aerocrine, AKL Ltd, Almirall, AstraZeneca, Boehringer Ingelheim, Chiesi, GlaxoSmithKline, Meda, Mundipharma, Napp, Novartis, Orion, Takeda, Teva, and Zentiva, a Sanofi company.

NR has received over the past 3 years: 1) fees for speaking, organizing education, participation in advisory boards or consulting from 3M, Aerocrine, Almirall, AstraZeneca, Boehringer Ingelheim, Chiesi, Cipla, GlaxoSmithKline, MSD-Chibret, Mundipharma, Novartis, Pfizer, Sanofi, Sandoz, Teva; 2) research grants from Novartis, Boehringer Ingelheim and Pfizer.

EVH is a consultant to RiRL and has received payment for writing and editorial support to Merck.

The University of Groningen has received money for DSP regarding an unrestricted educational grant for research from AstraZeneca, Chiesi. Travel to conferences for the European Respiratory Society (ERS) and/or the American Thoracic Society (ATS) has been partially funded by AstraZeneca, Chiesi, GSK, Takeda. Fees for consultancies were given to the University of Groningen by AstraZeneca, Boehringer Ingelheim, Chiesi, GSK, Takeda, and TEVA. Travel and lectures in China were paid by Chiesi.

RMCH and JAO are employees of the PHARMO Institute. This independent research institute performs financially supported studies for government and related health care authorities and several pharmaceutical companies.

DvE and JMK are employees of Takeda.

DBP has Board Membership with Aerocrine, Almirall, Amgen, AstraZeneca, Boehringer Ingelheim, Chiesi, Meda, Mundipharma, Napp, Novartis, and Teva. Consultancy: Almirall, Amgen, AstraZeneca, Boehringer Ingelheim, Chiesi, GlaxoSmithKline, Meda, Mundipharma, Napp, Novartis, Pfizer, Teva, and Zentiva; Grants/Grants Pending with UK National Health Service, British Lung Foundation, Aerocrine, AstraZeneca, Boehringer Ingelheim, Chiesi, Eli Lilly, GlaxoSmithKline, Meda, Merck, Mundipharma, Novartis, Orion, Pfizer, Respiratory Effectiveness Group, Takeda, Teva, and Zentiva; Payments for lectures/speaking: Almirall, AstraZeneca, Boehringer Ingelheim, Chiesi, Cipla, GlaxoS-mithKline, Kyorin, Meda, Merck, Mundipharma, Novartis, Pfizer, SkyePharma, Takeda, and Teva; Payment for manuscript preparation: Mundipharma and Teva; Patents (planned, pending or issued): AKL Ltd.; payment for the development of educational materials: GlaxoSmithKline, Novartis; Stock/Stock options: Shares in AKL Ltd which produces phyto-pharmaceuticals and owns 80% of Research in Real-Life Ltd, 75% of the social enterprise Optimum Patient Care Ltd and 75% of Observational and Pragmatic Research Institute Pte Ltd; received payment for travel/accommodations/meeting expenses from Aerocrine, Boehringer Ingelheim, Mundipharma, Napp, Novartis, and Teva; funding for patient enrolment or completion of research: Almirral, Chiesi, Teva, and Zentiva; peer reviewer for grant committees: Medical Research Council (2014), Efficacy and Mechanism Evaluation programme (2012), HTA (2014); and received unrestricted funding for investigator-initiated studies from Aerocrine, AKL Ltd, Almirall, Boehringer Ingelheim, Chiesi, Meda, Mundi-pharma, Napp, Novartis, Orion, Takeda, Teva, and Zentiva. The authors report no other conflicts of interest in this work.