6,537
Views
27
CrossRef citations to date
0
Altmetric
Research Article

Choosing propensity score matching over regression adjustment for causal inference: when, why and how it makes sense

, PhD
Pages 379-391 | Accepted 19 Aug 2007, Published online: 28 Oct 2008

Summary

This study identified when regression adjustment fails to adjust adequately for differences in observed covariates and where propensity score matching is the only alternative.

Multivariate analysis might fail to adjust for observed confounders if:

  • 1. The means of the propensity scores in the two groups are more than one-half a standard deviation apart unless distributions of the covariates in both groups are nearly symmetric, sample sizes of the two groups are approximately the same and distributions of the covariates in the two groups have similar variances;

  • 2. The ratio of the propensity score variances in the two groups is significantly different from one;

  • 3. The ratio of residual variances in the two groups after adjusting for the propensity score is significantly different from one.

Conducted retrospective analysis showed that the treatment effect would be an estimated $305 (or 26%) less if the misspecified outcome model had been chosen.

Introduction

Causal inference is challenging in all non-experimental studies because of the possibility of overt and hidden biasCitation1. When evaluating certain treatment programmes, overt bias can exist because the treatment and control groups are different in terms of certain observable factors such as age, gender and co-morbidities. Hidden bias may exist as a result of failing to control for unobservable factors, such as a doctor's prescribing patterns.

Propensity score matching and regression analysis are two statistical techniques used to remove overt bias. The treatment group generally differs from the control group in demographic, socioeconomic and clinical factors. To isolate the treatment effect, control of the observable confounders is necessary. Several published articles (of varying levels of sophistication) have sought to explain the theoretical background of the propensity score matching method and its applicationsCitation2–7.

Propensity score methods are increasingly used in the medical literature. A systematic literature search by Stürmer et al found that annual numbers of publications using propensity score methods increased from 8 in 1998 to 71 in 2003Citation8. Last year, the number of propensity score methods used was 171.

Proponents of the method outline several advantages of propensity score matching over regression analysis. First, propensity score methodology can design observational studies in an analogous way compared with the way randomised clinical trials are designed: without involving outcome variablesCitation4. Regression analysis uses the outcome as a left-hand-side variable, which is not supposed to be available during randomisation. Second, confounding by treatment variables is often the main challenge to validity, and the propensity score focuses directly on the treatment variableCitation9. Third, a matched analysis can eliminate non-comparable exposed subjectsCitation10. Finally, propensity score matching provides robust estimates when one has relatively few outcomes to compare with the number of potentially few outcomesCitation11–13. Therefore, clinical researchers who work with retrospective data are inclined to use the propensity score method.

Matching has long been discussed in the statistical literatureCitation3,Citation14–17. However, as with many other techniques borrowed from one discipline and applied to another, there is a tendency to use propensity score matching blindly in the field of health services research. There have been several reviews of the application of propensity scores in the medical literatureCitation8Citation18–22. For example, Weitzen et al reviewed 47 observational studies found in Medline and the Science Citation Index addressing clinical questions by using propensity score matchingCitation18. Of the 47 studies, 22 did not specify whether the propensity score created a balance between the exposure groups on the characteristics considered in the propensity model. In addition, 24 studies did not indicate what method was used to select variables, 30 were unclear about whether interaction terms were incorporated into the propensity score and 39 did not even consider goodness of fit. Until recently, there was no method to choose among different types of matching techniquesCitation1.

Proponents of regression analysis, however, focus on statistical advances in the regression literatureCitation23–25. Deciding on a matching algorithm and obtaining standard errors for inference through matching is more complicated than conducting regression analysis. Little of the existing literature examines this process. Thus, careful regression with good controls and flexible functional forms is often the best that can be done.

However, when covariates show real differences among groups, as summarised by RubinCitation7, covariance adjustments involve some degree of extrapolation. To illustrate, the following hypothetical example adjusts differences in the volume of certain surgical procedures in a comparison of teaching and non-teaching hospitals. The teaching hospitals range from 100 to 120 procedures per month and the non-teaching hospitals range from 40 to 60. Under the effect of exposure differences by patient volume, the covariance would adjust results so that they ostensibly apply to a mean volume of 80 in each group, even though neither group's volume is at or near this level.

Coherent guidelines to choose between propensity score matching and regression analysis are missing in the literature. In this article, both methods are evaluated and three quantifiable measures are applied for consideration before selecting the appropriate one. If certain criteria are satisfied, propensity score matching surpasses regression analysis not only in its design but also in its statistical properties. This paper demonstrates the application of these criteria to Medstat MarketScan data (Thomson-Medstat, Ann Arbor, MI), followed by an examination of treatment effects calculated both by propensity score matching and regression analysis, along with concluding thoughts.

Study design and methods

For many year both statistical and econometric literature warned that regression adjustment could not be trusted to remove all the bias if the original covariate distribution diverges widely. It has been shown that multivariate analysis might fail to adjust for observed confounder differences if any of the following criteria are metCitation4.

  • 1. The means of the propensity scores in the two groups are more than one-half a standard deviation (sd) apart, unless:

    • a. distributions of the covariates in both groups are nearly symmetric. One can use the D'Agostino test with the empirical correction developed by RoystonCitation26,Citation27. The test determines whether the computed values of skewness depart from the norms of zero. Other tests are also detailed in the literature and implemented by both STATA and SAS software;

    • b. sample sizes of the two groups are approximately the same. One simulation study suggests that if one group is <5% of the other group, the effect of sample size differences on the estimators is significantCitation28; and

    • c. distributions of the covariates in the two groups have similar variances. Variance comparison tests described by HoelCitation29 or Pagana and GauvreauCitation30 can be used to compare the variances in two groups. The tests are based on the fact that the ratio of the squared sd of the first group to the second is distributed as F-statistics with n1–1 and n2–1 degrees of freedom, where n1 and n2 are the sample sizes for the first and second group. This test is also implemented both by STATA and SAS software.

  • 2. The ratio of the propensity score variances in the two groups is significantly different from one (e.g. one-half or two are far too extreme).

  • 3. The ratio of residual variances in the two groups after adjusting for the propensity score is significantly different from one (e.g. one-half or two are far too extreme).

If the means and the variance of propensity scores of the two groups are not close to each other, the divergence in the covariates is too wide to make any reasonable conclusions.

In particular, it has been shown that linear regression on random samples with widely diverse covariates reduces the bias by more than 100% (overcorrecting) or less than 0% (increase the original bias)Citation7.

All of these conditions implicitly assume normally distributed covariates. The criteria for reliability of propensity score matching with non-normal covariates can be more complex. For example, as pointed out by Rubin, one obvious condition with non-normality covariates, propensity score matching requires strong overlap of distributions of the propensity scores in two groupsCitation16. BaserCitation31 showed the application of estimators proposed by Crump et alCitation32, which accounts for lack of overlap under propensity score matching, in a case study involving the analysis of health expenditure data for the US.

The criteria are valid for any regression methods that rely on linear additive effects in the covariatesCitation7. For example, commonly used models (ordinary least squares, logit, probit, log-linear regression and generalised linear models) to estimate the outcome of interest fall into this categoryCitation24.

Data source and sample

To illustrate the implications of using propensity score matching versus regression analysis, data from the MarketScan Commercial Claims and Encounters Database from 2002–2004 was examined to determine the effect of triptan treatment on migraine patients. This database contains the healthcare experience of more than 13 million individuals annually. The individuals' healthcare was provided in a variety of fee-for-service, fully capitated and partially capitated health plans, including preferred provider organisations, point of service plans, indemnity plans and health maintenance organisations. The variables in the database included age, gender, ICD-9 codes, plan type and geographic region.

Individuals were selected for the study sample if they had a confirmatory diagnosis of migraine (ICD-9-CM 346.XX) or menstrual migraine (ICD-9-CM 625.4) during the period of the 1st January 2002 through to the 31st December 2003. Additional inclusion criteria for the triptan cohort were:

  • • at least one prescription for a triptan during the study period of the 1st January 2002 through to the 31st December 2003;

  • • at least 6 months of continuous enrolment preceding the first triptan prescription (index event);

  • • at least 12 months of continuous enrolment following the index event;

  • • benefit eligible during the 18-month study period.

Additional inclusion criteria for the non-triptan cohort were:

  • • continuous enrolment for at least 18 months during the period of the 1st January 2002 through to the 31st December 2004;

  • • benefit eligible during the 18-month study period.

Triptan-using migraineurs on additional medications such as tricyclic antidepressants, anti-epileptics or beta-blockers for the treatment of their migraines were excluded from the triptan cohort. provides an overview of the study timeline, years of data and pre- and post-index periods.

Figure 1. Overview of the study timeline, years of data, and pre- and post-index periods.

Figure 1. Overview of the study timeline, years of data, and pre- and post-index periods.

The sociodemographic characteristics included age of household, percentage of female patients, urban residency and geographic region (northeast, north-central, south, west and other) and were measured at baseline. The percentage of patients whose healthcare was provided under capitated plans was included and measured at baseline. The Charlson Comorbidity Index was generated to capture the level and burden of co-morbidity and measured during the 6 months prior to the index date. The number of diagnostic codes measured during the pre-period term was included as a proxy for severity of illness. The year of diagnosis was included to control for the associated fixed factors. Total expenditures were calculated as the sum of inpatient, outpatient and pharmaceutical expenditures for all medical care services for the period of the 1st January 2002 through to the 31st December 2004. This included all services paid for by the healthcare insurance provider as well as co-payments and deductibles paid out-of-pocket by the individuals.

Results

The objective of this study was to decide how to estimate treatment effect (use of triptan) on migraine patients after controlling for observable characteristics between triptan and non-triptan patients. Should propensity score matching or regression analysis be used?

shows that, compared with non-triptan patients, those using triptan were older, more likely to be female and sicker. Unadjusted total expenditures for triptan users and non-triptan users were $7,495 and $6,009, respectively. The difference (treatment effect) of $1,486 was confounded with factors presented in .

Table 1. Summary values for triptan versus non-triptan groups.

To isolate the treatment effect from the observable confounders, adjusted total expenditures for each group should be estimated. Multivariate analysis or propensity score matching can be the method of choice. The criteria was applied in the previous section to determine whether multivariate analysis failed to adjust for observed differences.

Estimation of propensity scores for the variables in (except total expenditure) was carried out with logit. Logit is the most commonly used model to estimate the propensity score, although other approaches are available, including discriminant function analysis, classification and regression trees, or neural networksCitation9. In this estimation, several interaction terms were added into the models. An F-test on these interaction terms yielded insignificant results. Overall coefficients were significant and the area of the receiver operating characteristic curve was calculated to be 0.83. Estimated propensity scores are presented in . The means of the propensity scores in the two groups were more than one-half a sd apart (Criterion 1).

Table 2. Analysis of propensity score.

Therefore, the subcriteria was checked to see whether Criterion 1 could be waived. The D'Agostino test rejected the null hypothesis that the distribution of the covariates was symmetric. According to , skewness was significantly different from zero for each variable (Criterion 1a) (p=0.000). Sample sizes were not significantly different. The control group was only 2.6-times larger than the treatment group (Criterion 1b). According to the test described by Hoel, the distributions of the covariates diagnosis year, age, gender and number of diagnostic codes have significantly different variances (Criterion 1c).

Table 3. Skewness–kurtosis and variance of each covariate in both groups.

According to , the ratio of the propensity score variances in two groups was statistically different from one (Criterion 2) (p=0.000).

presents the ratio of the residual variances after adjusting for the propensity score as well as measures for Criterion 3. The estimated propensity score on each covariate was regressed and the residual of this regression was taken. Bootstrapping using the appropriate techniques demonstrated in Baser et al showed that, with the exception of north-central, south and west, the ratios were significantly different from oneCitation33.

Table 4. Residual analysis.

Overall, the proposed criteria suggested that multivariate analysis would not adjust for differences in observed characteristics. Following the guidelines proposed by Baser, the nearest-neighbour propensity score matching with caliper was applied, which is defined as 1/4 of the sd of estimated propensity scoreCitation1. presents the characteristics of triptan and non-triptan groups after propensity score matching. Differences in measured characteristics between treated and untreated patients were assessed by two methods. First, χCitation2 test was used to compare the dichotomous risk factors and a t-test for continuous variables. Second, the standardised difference was computed for each covariate. The standardised difference is defined as:

for continuous variables; and

for dichotomous variables.

Table 5. Descriptive table for treatment and control cohorts after matching.

It has been suggested that a standardised difference of <10% represents meaningful balance in a given covariate between the two groupsCitation34. All of the matched covariates fall within this range.

For the purpose of comparison, regression adjustment was also applied to estimate the treatment effect. Following Manning and Mullahy, the data suggest that if regression adjustment was tried, the generalised linear model with log link and gamma family would be most appropriateCitation35.

shows the treatment effects according to unadjusted value and value adjusted by propensity score matching and by regression analysis. According to propensity score matching, the treatment effect was $1,178 and significant. Regression adjustment would estimate this effect at $873. Therefore, the treatment effect would be estimated at $305 (or 26%) less if the misspecified outcome model had been chosen.

Table 6. Outcome measures.

Discussion

A broad surge of interest in fundamental approaches of drawing causal inference from observational data has emerged during the last decade. Propensity score matching and regression adjustment are two popular approaches in health services research. Once an investigator has decided to estimate treatment effect, he/she must consider which one of these two approaches to use. Recent systematic literature reviews compared the estimates of relationship between exposures with those obtained from multivariate models. Among 78 exposure–outcome associations in 43 studies evaluated both by propensity scores and regression models, Shah et al found that statistical significance differed between the two methods in only 10% of casesCitation19. Stürmer et al showed that the propensity score and regression model approaches differ by >20% only in 13% of casesCitation8. However, none of the previous studies explain under what conditions it is more likely to get significant differences, and in this respect this study fills a void in the current literature. The answer depends on the data at hand. If the distribution of the covariates is not too wide, one might want to use regression adjustment to estimate average treatment effect. However, if the distribution of covariates is wide, and regression adjustment fails to remove all the bias, then propensity score matching is the better alternative.

Once a researcher decides to use propensity score matching, a matching technique must be selected. Estimators compare only exact matches asymptotically and, therefore, provide the same answers. In a finite sample, however, the type of matching techniques selected matters. Each matching technique trades off bias and efficiency. For example, matching with replacement increases the average quality of matching, thus bias decreases but the variance increases. Baser provided a table to compare trade-offs in terms of bias and efficiency among six different matching techniquesCitation1, and also provided several measurable criteria to choose among different types of matching techniquesCitation1.

Investigators analysing retrospective data should consider the sources of bias that may affect their results. Although both regression adjustment and propensity score analysis can be used to control observable bias, each one requires certain assumptions. Sensitivity analysis of these two methods is important because neither is superior to the other a priori.

In the current study, the proposed strategy is evaluated in a single empirical example where the true association parameter is not known. Ideally, one would perform simulation studies or an empirical study where an informed guess about the ‘truth’ can be made and compared with the differences between the estimates and the true parameter. For example, Cook and Goldman compared estimates based on propensity scores and regression models using simulation and found that the propensity scores method displayed greater robustness in settings with high correlation between exposure and confoundersCitation36. In a recent article, Stukel et al compared several analytical methods for removing the effect of selection bias in observational studies where the results from randomised controlled trials are availableCitation37. The use of propensity score matching estimation compared with multivariate risk adjustment models yielded closer results than was obtained from randomised controlled trials. Although these results do not show how close the estimates were to the true parameter, since the criteria derived from the literature were based on the simulation results, the authors would suggest that matching on the propensity score was optimal in this dataset.

It should be noted that neither propensity score matching nor regression adjustment addresses or resolves problems due to imbalances in unmeasured factors. When unobservable bias exists, one can use a bounding approachCitation38, a difference-in-difference estimatorCitation20 or an instrumental variable approachCitation39. However, these estimations are also confounded by their own limitations.

Conclusions

Causal inference is challenging when studying observational or quasi-randomised experimental data because of the inevitable presence of self-selection. With this study, the authors present guidelines under which choosing propensity score matching over regression adjustment is a better choice. A case study involving the analysis of US health expenditure data has been presented to highlight how regression adjustment and the propensity score matching method have substantial effects on the magnitude of treatment effect.

The discussion in this article does not provide detailed or rigorous treatment of theory that underlines multivariate analysis or propensity score matching techniques. The authors encourage curious readers to consult books by WooldridgeCitation24,Citation25 for the theory behind multivariate analysis and a series of articles by RubinCitation4–7,Citation16,Citation17 for the theory behind propensity score matching.

References

  • Baser O. Too much ado about propensity score models? Comparing methods of propensity score matching. Value in Health 2006; 9: 377–385
  • Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41
  • Cochran WG, Rubin DB. Controlling bias in observational studies: a review. Sankhya 1973; 35: 417–446
  • Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology 2001; 2: 169–188
  • Rubin DB. Inference and missing data. Biometrika 1976; 63: 581
  • Rubin DB. The American College of Physicians. Estimating Causal Effects from Large Data Sets Using Propensity Scores. Philadelphia, PA 1997
  • Rubin DB, William G. Cochran's Impact on Statistics. Cochran's contributions to the design, analysis, and evaluation of observational studies. In:, SRS Rao, J Sedransk, W. G. John Wiley, New York, NY 1984; 37–69
  • Stürmer T, Joshi M, Glynn RJ, et al. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. Journal of Clinical Epidemiology 2006; 59: 437e1–437e24
  • Glynn RJ, Schneeweiss S, Wang PS, et al. Selective prescribing led to overestimation of the benefits of lipid-lowering drugs. Journal of Clinical Epidemiology 2006; 59: 819–828
  • Gu XS, Rosenbaum PR. Comparison of multivariate matching methods: structures, distances, and algorithms. Journal of Computational and Graphical Statistics 1993; 2: 405–420
  • Harrell FE, Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996; 15: 361–387
  • Cepeda MS. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. American Journal of Epidemiology 2003; 158: 280–287
  • King G, Zeng L. Logistic regression in rare events data. Political Analysis 2001; 9: 137.
  • Cochran WG. Analysis of covariance: its nature and uses. Biometrics 1957; 13: 261–281
  • Carpenter RG. Matching when covariables are normally distributed. Biometrika 1977; 64: 299
  • Rubin DB. Assignment to treatment group on the basis of a covariate. Journal of Educational Statistics 1977; 2: 1–26
  • Rubin DB. Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association 1979; 74: 318–328
  • Weitzen S, Lapane KL, Toledano AY, et al. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiology and Drug Safety 2004; 13: 841–853
  • Shah BR, Laupacis A, Hux JE, et al. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. Journal of Clinical Epidemiology 2005; 58: 550–559
  • Fu AZ, Dow WH, Liu GG. Propensity score and difference-in-difference methods: a study of second-generation antidepressant use in patients with bipolar disorder. Health Services and Outcomes Research Methodology 2007; 7: 23–38
  • Austin PC, Mamdani MM. A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin use. Statistics in Medicine 2006; 25: 2084–2106
  • Austin PC. The performance of different propensity score methods for estimating marginal odds ratios. Statistics in Medicine 2007; 26: 3078–3094.
  • Wooldridge J. Inverse Probability Weighted Estimation for General Missing Data Problems. East Lansing. MI: Michigan State University. 2006
  • Wooldridge JM. Introductory Econometrics: a modern approach, 3rd edn. Thomson/South-Western, Mason, OH 2006
  • Wooldridge JM. Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, MA 2002
  • D'Agostino RB, Belanger A, D'Agostino RB, Jr. A suggestion for using powerful and informative tests of normality. The American Statistician 1990; 44: 316–321
  • Royston B. sg3.5: comment on sg3.4 and an improved D'Agostino test. Stata Technical Bulletin 1991; 3: 23–24
  • Peduzzi P, Concato J, Kemper E, et al. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology 1996; 49: 1373–1379
  • Hoel PG. Introduction to Mathematical Statistics. J. Wiley, New York, NY 1947
  • Pagano M, Gauvreau K. Principles of Biostatistics. Duxbury Press, Pacific Grove, CA 2000
  • Baser O. Propensity score matching with limited overlap. Economics Bulletin 2007; 9: 1–8
  • Crump RK, Hotz VJ, Imbens GW, et al. Moving the Goalposts: addressing limited overlap in the estimation of average treatment effects by changing the estimand. National Bureau of Economic Research, Cambridge, MA 2006
  • Baser O, Crown WH, Pollicino C. Guidelines for selecting among different types of bootstraps. Current Medical Research and Opinion 2006; 22: 799–808
  • Normand SLT, Landrum MB, Guadagnoli E, et al. Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores. Journal of Clinical Epidemiology 2001; 54: 387–398.
  • Manning WG, Mullahy J. Estimating log models: to transform or not to transform?. Journal of Health Economics 2001; 20: 461–494.
  • Cook EF, Goldman L. Performance of tests of significance based on stratification by a multivariate confounder score or by a propensity score. Journal of Clinical Epidemiology 1989; 42: 317–324
  • Stukel TA, Fisher ES, Wennberg DE, et al. Analysis of observational studies in the presence of treatment selection bias: effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. Journal of the American Medical Association 2007; 297: 278
  • Rosenbaum PR. Observational Studies. Springer-Verlag, New York, NY 2002
  • Newhouse JP, McClellan M. Econometrics in outcomes research: the use of instrumental variables. Annual Review of Public Health 1998; 19: 17–34.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.