I’ve been conducting quite a few case-control or propensity score matching studies lately. So I wrote some helper functions for use after the SPSS FUZZY
command. These create the case-control dataset, plus calculate some of the standardized bias metrics for matching on continuous outcomes.
The use case here is if you have a sub-set of treated individuals, and you want to draw a comparison sample matched on certain characteristics (which can include just one propensity score and/or multiple covariates). Here is the macro to follow along, and I will provide a quick walkthrough of how it works. (There is documentation in the header for what the parameters are and what the function returns.)
So first I am going to import my macro using INSERT
:
*Inserting the macro.
INSERT FILE = "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Matching_StandBias\PropBalance_Macro.sps".
Now just for illustration I am going to make a fake dataset to illustrate the utility of matching. Here I have a universe of 2,000 people. There is a subset of treated individuals (165), but they are only selected if they are under 28 years old and male.
*Create a fake dataset.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 2000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME OrigData.
COMPUTE Male = RV.BERNOULLI(0.7).
COMPUTE YearsOld = RV.UNIFORM(18,40).
FORMATS Male (F1.0) YearsOld (F2.0).
DO IF Male = 1 AND YearsOld <= 28.
COMPUTE Treated = RV.BERNOULLI(0.3).
ELSE.
COMPUTE Treated = 0.
END IF.
COMPUTE #OutLogit = 0.7 + 0.5*Male - 0.05*YearsOld - 0.7*Treated.
COMPUTE #OutProb = 1/(1 + EXP(-#OutLogit)).
COMPUTE Outcome = RV.BERNOULLI(#OutProb).
FREQ Treated Outcome.
So what happens when we make comparisons among the entire sample, which includes females and older people?
*Compare means with the original full sample.
T-TEST GROUPS=Treated(0 1) /VARIABLES=Outcome.
We get basically no difference, our treated mean is 0.40 and the untreated mean is 0.39. But instead of comparing the 165 to the entire sample, we draw more reasonable control cases. Here we do an exact match on Male
, and then we do a fuzzy match on YearsOld
to within 3 years.
*Draw the comparison sample based on Male (exact) and YearsOld (Fuzzy).
FUZZY BY=Male YearsOld SUPPLIERID=Id NEWDEMANDERIDVARS=Match1 GROUP=Treated
EXACTPRIORITY=FALSE FUZZ=0 3 MATCHGROUPVAR=MGroup DRAWPOOLSIZE=CheckSize
/OPTIONS SAMPLEWITHREPLACEMENT=FALSE MINIMIZEMEMORY=TRUE SHUFFLE=TRUE SEED=10.
Now what the FUZZY
command does in SPSS is creates a new variable, named Match1
here, that places the matched Id in the same row as the original treated sample. You cannot easily make the updated comparisons that you want though in this data format. So after writing the code to do this about 7 times, I decided to make it into a simple macro. Here is an example of calling my macro, !MatchedSample
.
*Now run my macro to make the matched sample.
!MatchedSample Dataset=OrigData Id=Id Case=Treated MatchGroup=MGroup Controls=[Match1]
MatchVars=[YearsOld] OthVars=Outcome Male.
This then spits out two new datasets, as well as appends a new variable to the original dataset named MatchedSample
to show what cases have been matched. Then it is simple to see the difference in our means among our matched sample.
*Now the t-test with the matched sample subset.
DATASET ACTIVATE MatchedSamples.
T-TEST GROUPS=Treated(0 1) /VARIABLES=Outcome.
Which shows the same mean for treated, 0.40 (since all the treated were matched), but the comparison group now has a mean of 0.51, so here the treatment reduced the outcome.
The macro also provides an additional dataset named AggStats
that estimates the standardized bias in the original sample vs. the standardized bias in the matched sample. (Standardized bias is just Cohen’s D measure multiplied by 100.) This then also calculates the standardized bias reduction for each continuous covariate. Before I forget, a neat way to test for balance jointly (instead of one variable at a time) is to conduct an additional regression equation predicting treatment and then testing for all coefficients equal to zero.
In this fake example the propensity scores would not be needed, you could just estimate a typical logistic regression equation controlling for YearsOld
and Male
. But the utility of matching comes from when you don’t know the functional form of how those covariates affect the outcome. So if the outcome was a very non-linear function of age, you don’t have to worry about estimating that function, you can just match on age and still get a reasonable comparison of the mean difference for treated vs. not-treated.