Comparing samples post-matching – some helper functions after FUZZY (SPSS)

I’ve been conducting quite a few case-control or propensity score matching studies lately. So I wrote some helper functions for use after the SPSS FUZZY command. These create the case-control dataset, plus calculate some of the standardized bias metrics for matching on continuous outcomes.

The use case here is if you have a sub-set of treated individuals, and you want to draw a comparison sample matched on certain characteristics (which can include just one propensity score and/or multiple covariates). Here is the macro to follow along, and I will provide a quick walkthrough of how it works. (There is documentation in the header for what the parameters are and what the function returns.)

So first I am going to import my macro using INSERT:

*Inserting the macro.
INSERT FILE = "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Matching_StandBias\PropBalance_Macro.sps".

Now just for illustration I am going to make a fake dataset to illustrate the utility of matching. Here I have a universe of 2,000 people. There is a subset of treated individuals (165), but they are only selected if they are under 28 years old and male.

*Create a fake dataset.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 2000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME OrigData.
COMPUTE Male = RV.BERNOULLI(0.7).
COMPUTE YearsOld = RV.UNIFORM(18,40).
FORMATS Male (F1.0) YearsOld (F2.0).
DO IF Male = 1 AND YearsOld <= 28.
  COMPUTE Treated = RV.BERNOULLI(0.3).
ELSE.
  COMPUTE Treated = 0.
END IF.
COMPUTE #OutLogit = 0.7 + 0.5*Male - 0.05*YearsOld - 0.7*Treated.
COMPUTE #OutProb = 1/(1 + EXP(-#OutLogit)).
COMPUTE Outcome = RV.BERNOULLI(#OutProb).
FREQ Treated Outcome.

So what happens when we make comparisons among the entire sample, which includes females and older people?

*Compare means with the original full sample.
T-TEST GROUPS=Treated(0 1) /VARIABLES=Outcome.

We get basically no difference, our treated mean is 0.40 and the untreated mean is 0.39. But instead of comparing the 165 to the entire sample, we draw more reasonable control cases. Here we do an exact match on Male, and then we do a fuzzy match on YearsOld to within 3 years.

*Draw the comparison sample based on Male (exact) and YearsOld (Fuzzy).
FUZZY BY=Male YearsOld SUPPLIERID=Id NEWDEMANDERIDVARS=Match1 GROUP=Treated
    EXACTPRIORITY=FALSE FUZZ=0 3 MATCHGROUPVAR=MGroup DRAWPOOLSIZE=CheckSize
/OPTIONS SAMPLEWITHREPLACEMENT=FALSE MINIMIZEMEMORY=TRUE SHUFFLE=TRUE SEED=10.

Now what the FUZZY command does in SPSS is creates a new variable, named Match1 here, that places the matched Id in the same row as the original treated sample. You cannot easily make the updated comparisons that you want though in this data format. So after writing the code to do this about 7 times, I decided to make it into a simple macro. Here is an example of calling my macro, !MatchedSample.

*Now run my macro to make the matched sample.
!MatchedSample Dataset=OrigData Id=Id Case=Treated MatchGroup=MGroup Controls=[Match1] 
  MatchVars=[YearsOld] OthVars=Outcome Male.

This then spits out two new datasets, as well as appends a new variable to the original dataset named MatchedSample to show what cases have been matched. Then it is simple to see the difference in our means among our matched sample.

*Now the t-test with the matched sample subset.
DATASET ACTIVATE MatchedSamples.
T-TEST GROUPS=Treated(0 1) /VARIABLES=Outcome.

Which shows the same mean for treated, 0.40 (since all the treated were matched), but the comparison group now has a mean of 0.51, so here the treatment reduced the outcome.

The macro also provides an additional dataset named AggStats that estimates the standardized bias in the original sample vs. the standardized bias in the matched sample. (Standardized bias is just Cohen’s D measure multiplied by 100.) This then also calculates the standardized bias reduction for each continuous covariate. Before I forget, a neat way to test for balance jointly (instead of one variable at a time) is to conduct an additional regression equation predicting treatment and then testing for all coefficients equal to zero.

In this fake example the propensity scores would not be needed, you could just estimate a typical logistic regression equation controlling for YearsOld and Male. But the utility of matching comes from when you don’t know the functional form of how those covariates affect the outcome. So if the outcome was a very non-linear function of age, you don’t have to worry about estimating that function, you can just match on age and still get a reasonable comparison of the mean difference for treated vs. not-treated.

Fuzzy matching in SPSS using a custom python function

The other day I needed to conduct propensity score matching, but I was working with geographic data and wanted to restrict the matches to within a certain geographic distance. To do this I used the FUZZY extension command, which allows you to input a custom function. To illustrate I will be using some example data from my dissertation, and the code and data can be downloaded here.

So first lets grab the data and reduce it down a bit to only the variables we will be using. This dataset are street segments and intersections in DC, and the variables are crime, halfway houses, sidewalk cafes, and bars. Note to follow along you need to update the file handle to your machine.

FILE HANDLE save /NAME = "!!!Your Handle Here!!!".
GET FILE = "save\BaseData.sav".
DATASET NAME DC_Data.
SORT CASES BY MarID.

*Reduce the variable list down a bit.
MATCH FILES FILE = * /KEEP  MarID XMeters YMeters OffN1 OffN2 OffN3 OffN4 OffN5 OffN6 OffN7 OffN8 OffN9 
                            TotalCrime HalfwayHouse SidewalkCafe TypeC_D.

Now as a quick illustration, I am going to show a propensity score analysis predicting the location of halfway houses in DC – and see if street units with a halfway house are associated with more violence. Do not take this as a serious analysis, just as an illustration of the workflow. The frequency shows there are only 9 halfway houses in the city, and the compute statements collapse crimes into violent and non-violent. Then I use PLUM to fit the logistic model predicting the probability of treatment. I use non-violent crimes, sidewalk cafes, and bars as predictors.

FREQ HalfwayHouse.
COMPUTE Viol = OffN1 + OffN4 + OffN5 + OffN6.
COMPUTE NonViol = OffN2 + OffN3 + OffN7 + OffN8 + OffN9.

*Fitting logit model via PLUM.
PLUM HalfwayHouse WITH NonViol SidewalkCafe TypeC_D
  /CRITERIA=CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8)
  /LINK=LOGIT
  /PRINT=FIT PARAMETER SUMMARY
  /SAVE=ESTPROB.

The model is very bad, but we can see that sidewalk cafes are never associated with a halfway house! (Again this is just an illustration – don’t take this as a serious analysis of the effects of halfway houses on crime.) Now we need to make a custom function with which to restrict matches not only based on the probability of treatment, but also based on the geographic location. Here I made a file named DistFun.py, and placed in it the following functions:

#These functions are for SPSS's fuzzy case control matching
import math
#distance under 500, and caliper within 0.02
def DistFun(d,s):
  dx = math.pow(d[1] - s[1],2)  
  dy = math.pow(d[2] - s[2],2)  
  dis = math.sqrt(dx + dy)
  p = abs(d[0] - s[0])
  if dis < 500 and p < 0.02:
    t = 1
  else:
    t = 0
  return t
#distance over 500, but under 1500
def DistBuf(d,s):
  dx = math.pow(d[1] - s[1],2)  
  dy = math.pow(d[2] - s[2],2)  
  dis = math.sqrt(dx + dy)
  p = abs(d[0] - s[0])
  if dis  500 and p < 0.02:
    t = 1
  else:
    t = 0
  return t

The FUZZY command expects a function to return either a 1 for a match and 0 otherwise, and the function just takes a fixed set of vectors. The first function DistFun, takes a list where the first two elements are the coordinates, and the last element is the probability of treatment. It then calculates the euclidean distance, and returns a 1 if the distance is under 500 and the absolute distance in propensity scores is under 0.02. The second function is another example if you want matches not too close but not too far away, at a distance of between 500 and 1500. (In this dataset my coordinates are projected in meters.)

Now to make the research reproducible, what I do is save this python file, DistFun.py, in the same folder as the analysis. To make this an importable function in SPSS for FUZZY you need to do two things. 1) Also have the file __init__.py in the same folder (Jon Peck made the comment this is not necessary), and 2) add this folder to the system path. So back in SPSS we can add the folder to sys.path and check that our function is importable. (Note that this is not permanent change to the PATH system variable in windows, and is only active in the same SPSS session.)

*Testing out my custom function.
BEGIN PROGRAM Python.
import sys
sys.path.append("!!!Your\\Path\\Here!!!\\")

import DistFun

#test case
x = [0,0,0.02]
y = [0,499,0.02]
z = [0,500,0.02]
print DistFun.DistFun(x,y)
print DistFun.DistFun(x,z)
END PROGRAM.

Now we can use the FUZZY command and supply our custom function. Without the custom function you could specify the distance in any one dimension on the FUZZ command (e.g. here something like FUZZ = 0.02 500 500), but this produces a box, not a circle. Also with the custom function you can do more complicated things, like my second buffer function. The function takes the probability of treatment along with the two spatial coordinates of the street unit.

*This uses a custom function I made to restrict matches to within 500 meters.
FUZZY BY=EST2_1 XMeters YMeters SUPPLIERID=MarID NEWDEMANDERIDVARS=Match1 Match2 Match3 GROUP=HalfwayHouse CUSTOMFUZZ = "DistFun.DistFun"
    EXACTPRIORITY=FALSE  
MATCHGROUPVAR=MGroup 
/OPTIONS SAMPLEWITHREPLACEMENT=FALSE MINIMIZEMEMORY=TRUE SHUFFLE=TRUE SEED=10.

This takes less than a minute, and in this example provides a full set of matches for all 9 cases (not surprising, since the logistic regression equation predicting halfway house locations is awful). Now to conduct the propensity score analysis just takes alittle more data munging. Here I make a second data of just the matched locations, and then reshape the cases and controls so they are in long format. Then I merge the original data back in.

*Reshape, merge back in, and then conduct outcome analysis.
DATASET COPY PropMatch.
DATASET ACTIVATE PropMatch.
SELECT IF HalfwayHouse = 1.
VARSTOCASES /MAKE MarID FROM MarID Match1 Match2 Match3
            /INDEX Type
            /KEEP MGroup.

*Now remerge original data back in.
SORT CASES BY MarID.
MATCH FILES FILE = *
  /TABLE = 'DC_Data'
  /BY MarID. 

Now you can conduct the analysis. For example most people use t-tests both for the outcome and to assess balance on the pre-treatment variables.

*Now can do your tests.
T-TEST GROUPS=HalfwayHouse(0 1)
  /MISSING=ANALYSIS
  /VARIABLES=Viol
  /CRITERIA=CI(.95).

One of my next projects will be to use this workflow to conduct fuzzy name matching within and between police databases using custom string distance functions.