Matching and propensity scores

Updated 1/15

I. General background

If you are new to the topic, this book is an informative, relatively non-technical introduction:

Holmes, W.M. (2013). Using Propensity Scores in Quasi-Experimental Designs. SAGE Publications.

Another great text, which is more technical than Holmes but still quite accessible, is

Guo, S.Y. and Fraser, M.W. (2014). Propensity Score Analysis: Statistical Methods and Applications. SAGE Publications. (Be sure to get the 2nd edition, it is much better than the 1st.)

If you are considering using psmatch2 in Stata, this blog post about weights in psmatch2 should prove useful.

II. Matching and complex samples

Matching advocates tend to emphasize internal validity over external validity, yet many scholars are still concerned with the latter. Thus the question I am often asked: how do I do matching when I have survey weights that correct for oversampling and nonresponse? 

DuGoff et al. (2014). Generalizing observational study results: Applying propensity score methods to complex surveys.


To provide a tutorial for using propensity score methods with complex survey data.

Data Sources

Simulated data and the 2008 Medical Expenditure Panel Survey.

Study Design

Using simulation, we compared the following methods for estimating the treatment effect: a naïve estimate (ignoring both survey weights and propensity scores), survey weighting, propensity score methods (nearest neighbor matching, weighting, and subclassification), and propensity score methods in combination with survey weighting. Methods are compared in terms of bias and 95 percent confidence interval coverage. In Example 2, we used these methods to estimate the effect on health care spending of having a generalist versus a specialist as a usual source of care.

Principal Findings

In general, combining a propensity score method and survey weighting is necessary to achieve unbiased treatment effect estimates that are generalizable to the original survey target population.


Propensity score methods are an essential tool for addressing confounding in observational studies. Ignoring survey weights may lead to results that are not generalizable to the survey target population. This paper clarifies the appropriate inferences for different propensity score methods and suggests guidelines for selecting an appropriate propensity score method based on a researcher’s goal.

III. Matching with non-binary treatments (dosage models)

Hirano and Imbens. (2004) The propensity score with continuous treatments.

Much of the work on propensity score analysis has focused on the case where the treatment is binary. In this chapter we examine an extension to the propensity score method, in a setting with a continuous treatment. Following Rosenbaum and Rubin (1983) and most of the other literature on propensity score analysis, we make an unconfoundedness or ignorability assumption, that adjusting for differences in a set of covariates removes all biases in comparisons by treatment status. Then, building on Imbens (2000) we define a generalization of the binary treatment propensity score, which we label the generalized propensity score (GPS). We demonstrate that the GPS has many of the attractive properties of the binary treatment propensity score. Just as in the binary treatment case, adjusting for this scalar function of the covariates removes all biases associated with differences in the covariates. The GPS also has certain balancing properties that can be used to assess the adequacy of particular specifications of the score. We discuss estimation and inference in a parametric version of this procedure, although more flexible approaches are also possible. We apply this methodology to a data set collected by Imbens, Rubin, and Sacerdote (2001). The population consists of individuals winning the Megabucks lottery in Massachusetts in the mid-1980’s. We are interested in effect of the amount of the prize on subsequent labor earnings. Although the assignment of the prize is obviously random, substantial item and unit nonresponse led to a selected sample where the amount of the prize is no longer independent of background characteristics. We estimate the average effect of the prize adjusting for differences in background characteristics using the propensity score methodology, and compare the results to conventional regression estimates. The results suggest that the propensity score methodology leads to credible estimates, that can be more robust than simple regression estimates.

Luckily, the Hirano and Imbens approach has been turned into a Stata command, making the estimation of dosage-response models a relatively simple task:

Bia and Mattei. (2008). A Stata package for the estimation of the dose–response function through adjustment for the generalized propensity score.

In this article, we briefly review the role of the propensity score in estimating dose–response functions as described in Hirano and Imbens (2004, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, 73–84). Then we present a set of Stata programs that estimate the propensity score in a setting with a continuous treatment, test the balancing property of the generalized propensity score, and estimate the dose–response function. We illustrate these programs by using a dataset collected by Imbens, Rubin, and Sacerdote (2001, American Economic Review 91: 778–794).

IV. Matching with grouped data

This is a tricky area. If contextual-level variables are driving treatment and outcome, then you face a trade-off: match units within groups, thus controlling for unobserved group differences, but at the risk of poor matches; or, match across groups for good matches, but then risk group-level confounders.

Rickles and Seltzer (2014). A Two-Stage Propensity Score Matching Strategy for Treatment Effect Estimation in a Multisite Observational Study.

When nonrandom treatments occur across sites, within-site matching (WM) is often desirable. This approach, however, can significantly reduce treatment group sample size and exclude substantively important subgroups. To limit these drawbacks, we extend a matching approach developed by Stuart and Rubin to a multisite study. We demonstrate the proposed method through a multisite analysis of algebra enrollment effects in 50 middle schools, where within each school students are assigned to algebra or pre-algebra and test the utility of the proposed method with a simulation study. The results document the method’s conceptual appeal and indicate that two-stage matching is a viable alternative to strict WM or matching that ignores the nested data structure (pooled matching).