Turning SPSS data into Python data

Previously I blogged about how to take Python data and turn it back into SPSS data. Here we are going to do the opposite — turn SPSS data into Python objects. First to start out we will make a simple dataset of three variables.

DATA LIST Free /X Y (2F1.0) Z (A1). 
BEGIN DATA
1 2 A
4 5 B
7 8 C
END DATA.
DATASET NAME Test.
EXECUTE.

To import this data into Python, we need to import the spss class of functions, which then you can read cases from the active dataset using the Cursor attribute. Here is an example of grabbing all of the cases.

*Importing all of the data.
BEGIN PROGRAM Python.
import spss
dataCursor = spss.Cursor()
AllData = dataCursor.fetchall()
dataCursor.close()
print AllData
END PROGRAM.

What this then prints out is ((1.0, 2.0, 'A'), (4.0, 5.0, 'B'), (7.0, 8.0, 'C')), a set of nested tuples. You can also just grab one case by replacing dataCursor.fetchall() with dataCursor.fetchone(), in which case it will just return one tuple.

To only grab particular variables from the list, you can pass a set of indices in the spss.Cursor object. Remember, Python indices start at zero, so if you want the first and second variables in the dataset, you need to grab the 0 and 1 indices.

*Only grabbing certain variables.
BEGIN PROGRAM Python.
dataNum = spss.Cursor([0,1])
spNumbers = dataNum.fetchall()
dataNum.close()
print spNumbers
END PROGRAM.

This subsequently prints out ((1.0, 2.0), (4.0, 5.0), (7.0, 8.0)). When grabbing one variable, you may want just a list of the objects instead of the nested tuples. Here I use list comprehension to turn the resulting tuples for the Z variable into a nice list.

*Converting to a nice list.
BEGIN PROGRAM Python.
dataAlp = spss.Cursor([2])
spAlp = dataAlp.fetchall()
dataAlp.close()
spAlp_list = [i[0] for i in spAlp] #convert to nice list
print spAlp
print spAlp_list
END PROGRAM.

The first print object is (('A',), ('B',), ('C',)), but the second is ['A', 'B', 'C'].

The above code works fine if you know the position of the variable in the file, but if the position can change this won’t work. Here is a one liner to get the variable names of the active dataset and plop them in a list.

*Way to get SPSS variable names.
BEGIN PROGRAM Python.
varList = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
print varList
END PROGRAM.

Now if you have your list of variable names you want, you can figure out the index value. There are two ways to do it, iterate over the list of variable names in the dataset, or iterate over the list of your specified variables. I do the latter here (note this will result in an error if you supply a variable name not in the dataset).

*Find the indices of specific variables.
BEGIN PROGRAM Python.
LookVars = ["X","Z"]
VarInd = [varList.index(i) for i in LookVars]
print VarInd
END PROGRAM.

Now you can just supply VarInd above to the argument for spss.Cursor to grab those variables. Here I wrapped it all up in a function.

*Easy function to use.
BEGIN PROGRAM Python.
import spss
def AllSPSSdat(vars):
  if vars == None:
    varNums = range(spss.GetVariableCount())
  else:
    allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
    varNums = [allvars.index(i) for i in vars]
  data = spss.Cursor(varNums)
  pydata = data.fetchall()
  data.close()
  return pydata
END PROGRAM.

You can either supply a list of variables or None, in the latter case all of the variables are returned.

BEGIN PROGRAM Python.
MyDat = AllSPSSdat(vars=["Y","Z"])
print MyDat
END PROGRAM.

This set of nested tuples is then pretty easy to convert to other Python objects. Panda’s dataframes, Numpy arrays, and NetworkX objects are all one liners. Here is turning the entire dataset into a panda’s data frame.

*Turn into pandas data frame.
BEGIN PROGRAM Python.
import pandas as pd
MyDat = AllSPSSdat(vars=None)
allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
PanDat = pd.DataFrame(list(MyDat),columns=allvars)
print PanDat
END PROGRAM.

Which prints out.

   X  Y  Z 
0  1  2  A 
1  4  5  B 
2  7  8  C

Using Python to grab Google Street View imagery

I am at it again with using Google data. For a few projects I was interested in downloading street view imagery data. It has been used in criminal justice applications as a free source for second hand systematic social observation by having people code aspects of disorder from the imagery (instead of going in person) (Quinn et al., 2014), as estimates of the ambient walking around population (Yin et al., 2015), and examining criminogenic aspects of the built environment (Vandeviver, 2014).

I think it is just a cool source of data though to be honest. See for example Phil Cohen’s Family Inequality post in which he shows examples of auctioned houses in Detroit over time.

Using the Google Street View image API you can submit either a set of coordinates or an address and have the latest street view image returned locally. This ends up being abit simpler than my prior examples (such as the street distance API or the places API) because it just returns the image blob, no need to parse JSON.

Below is a simple example in python, using a set of addresses in Detroit that are part of a land bank. This function takes an address and a location to download the file, then saves the resulting jpeg to your folder of choice. I defaulted for the image to be 1200×800 pixels.

import urllib, os

myloc = r"C:\Users\andrew.wheeler\Dropbox\Public\ExampleStreetView" #replace with your own location
key = "&key=" + "" #got banned after ~100 requests with no key

def GetStreet(Add,SaveLoc):
  base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
  MyUrl = base + urllib.quote_plus(Add) + key #added url encoding
  fi = Add + ".jpg"
  urllib.urlretrieve(MyUrl, os.path.join(SaveLoc,fi))

Tests = ["457 West Robinwood Street, Detroit, Michigan 48203",
         "1520 West Philadelphia, Detroit, Michigan 48206",
         "2292 Grand, Detroit, Michigan 48238",
         "15414 Wabash Street, Detroit, Michigan 48238",
         "15867 Log Cabin, Detroit, Michigan 48238",
         "3317 Cody Street, Detroit, Michigan 48212",
         "14214 Arlington Street, Detroit, Michigan 48212"]

for i in Tests:
  GetStreet(Add=i,SaveLoc=myloc)

Dropbox has a nice mosaic view for a folder of pictures, you can view all seven photos here. Here is the 457 West Robinwood Street picture:

In my tests my IP got banned after around 100 images, but you can get a verified google account which allows 25,000 image downloads per day. Unfortunately the automatic API only returns the most recent image – there is no way to return older imagery nor know the date-stamp of the current image. (You technically could download the historical data if you know the pano id for the image. I don’t see any way though to know the available pano id’s though.) Update — as of 2018 there is now a Date associated with the image, specifically a Year-Month, but no more specific than that. Not being able to figure out historical pano id’s is still a problem as far as I can tell as well.

But this is definitely easier for social scientists wishing to code images as opposed to going into the online maps. Hopefully the API gets extended to have dates and a second API to return info. on what image dates are available. I’m not sure if Mike Bader’s software app is actually in the works, but for computer scientists there is a potential overlap with social scientists to do feature extraction of various social characteristics, in addition to manual coding of the images.


Update: here is a version that works for python 3+. Currently you need to have a key, no more getting a few free images before being cut off.

# For Python Versions 3+
# Tested V3.10 on 12/15/2021
import os
import urllib.parse
import urllib.request

myloc = r"??????????????" #replace with your own location
key = "&key=" + "????????????" #you need an actual key now!!

def GetStreet(Add,SaveLoc):
  base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
  MyUrl = base + urllib.parse.quote_plus(Add) + key #added url encoding
  fi = Add + ".jpg"
  urllib.request.urlretrieve(MyUrl, os.path.join(SaveLoc,fi))

Tests = ["457 West Robinwood Street, Detroit, Michigan 48203",
         "1520 West Philadelphia, Detroit, Michigan 48206",
         "2292 Grand, Detroit, Michigan 48238",
         "15414 Wabash Street, Detroit, Michigan 48238",
         "15867 Log Cabin, Detroit, Michigan 48238",
         "3317 Cody Street, Detroit, Michigan 48212",
         "14214 Arlington Street, Detroit, Michigan 48212"]

for i in Tests:
  GetStreet(Add=i,SaveLoc=myloc)

Using KDTree’s in python to calculate neighbor counts

For a few different projects I’ve had to take a set of crime data and calculate the number of events nearby. It is a regular geospatial task, counting events in a particular buffer, but one that can be quite cumbersome if you have quite a few points to cross-reference. For one project I needed to calculate this for 4,500 street segments and intersections against a set of over 100,000 crime incidents. While this is not big data, if you go the brute force route and simply calculate all pair-wise distances, this results in 450 million comparisons. Since all I wanted was the number of counts within a particular distance buffer, KDTree’s offer a much more efficient search solution.

Here I give an example in Python using numpy and the nearest neighbor algorithms available in SciPy. This example is calculating the number of shootings in DC within 1 kilometer of schools. The shooting data is sensor data via ShotSpotter, and is publicly available at the new open data site. The school data I calculated the centroids based on school area polygons, available from the legacy DC data site. This particular example was motivated by this Urban Institute post.

There are around 170 schools and under 40,000 total shootings, so this would not be a big issue to calculate all the pairwise distances. It is a simple and quick example though. Here I have the Python code and CSV files are available to download.

So first we will just import the spatial data into numpy arrays. Note each file has the coordinates in projected meters.

import numpy as np
#getting schools and shootings from CSV files
schools = np.genfromtxt('DC_Schools.csv', delimiter=",", skip_header=1)
shootings = np.genfromtxt('ShotSpotter.csv', delimiter=",", skip_header=1, usecols=(4,5))

Next we can import the KDTree function, and then supply it with the two spatial coordinates for the shootings. While this is a relatively small file, even for my example larger set of crime data with over 100,000 incidents building the tree was really fast, less than a minute.

#making KDTree, and then searching within 1 kilometer of school
from sklearn.neighbors import KDTree
shoot_tree = KDTree(shootings)

Finally you can then search within a particular radius. You can either search one location at a time, but here I do a batch search and count the number of shootings that are within 1,000 meters from each school. Again this is a small example, but even with my 4,500 locations against 100,000 crimes it was very fast. (Whereas my initial SPSS code to do all 450 million pairwise combinations was taking all night.)

shoot_tree.query_radius(schools[:,[1,2]],r=1000,count_only=True)

Which produces an array of the counts for each of the schools:

array([1168,  179, 2384,  686,  583, 1475,  239, 1890, 2070,  990,   74,
        492,   10,  226, 2057, 1813, 1137,  785,  742, 1893, 4650, 1910,
          0,  926, 2380,  818, 2868, 1230,    0, 3230, 1103, 2253, 4531,
       1548,    0,    0,    0, 1002, 1912,    0, 1543,    0,  580,    4,
       1224,  843,  212,  591,    0,    0, 1127, 4520, 2283, 1413, 3255,
        926,  972,  435, 2734,    0, 2828,  724,  796,    1, 2016, 2540,
        369, 1903, 2216, 1697,  155, 2337,  496,  258,  900, 2278, 3123,
        794, 2312,  636, 1663,    0, 4896,  604, 1141,    7,    0, 1803,
          2,  283,  270, 1395, 2087, 3414,  143,  238,  517,  238, 2017,
       2857, 1100, 2575,  179,  876, 2175,  450, 5230, 2441,   10, 2547,
        202, 2659, 4006, 2514,  923,    6, 1481,    0, 2415, 2058, 2309,
          3, 1132,    0, 1363, 4701,  158,  410, 2884,  979, 2053, 1120,
       2048, 2750,  632, 1492, 1670,  242,  131,  863, 1246, 1361,  843,
       3567, 1311,  107, 1501,    0, 2176,  296,  803,    0, 1301,  719,
         97, 2312,  150,  475,  764, 2078,  912, 2943, 1453,  178,  177,
        389,  244,  874,  576,  699,  295,  843,  274])

And that is it! Quite simple.

The ShotSpotter data is interesting, with quite a few oddities that are worth exploring. I recently wrote a chapter on SPSS’s geospatial tools for an upcoming book on advanced SPSS features, providing a predictive police example predicting weekly shootings using the new geospatial temporal modelling tool. The book is edited by Keith McCormick and Jesus Salcedo, and I will write a blog post whenever it comes out highlighting what the chapter goes over.

Using local Python objects in SPSSINC TRANS – examples with network statistics

When using SPSSINC TRANS, you have a wider array of functions to compute on cases in SPSS. Within the local session, you can create your own python functions within a BEGIN PROGRAM and END PROGRAM block. In SPSSINC TRANS you pass in the values in the current dataset, but you can also create functions that use data in the local python environment as well. An example use case follows in which you create a network in the local python environment using SPSS data, and then calculate several network statistics on the nodes. Here is a simple hierarchical network dataset that signifies managers and subordinates in an agency.

*Edge list. 
DATA LIST FREE / Man Sub (2F1.0). 
BEGIN DATA 
1 2 
2 3 
2 4 
3 5 
3 6 
4 7 
4 8 
END DATA. 
DATASET NAME Boss. 

We can subsequently turn this into a NetworkX graph with the code below. Some of my prior SPSS examples using NetworkX had a bit more complicated code using loops and turning the SPSS dataset into the network object. But actually the way SPSS dumps the data in python (as a tuples nested within a list) is how the add_edges_from function expects it in NetworkX, so no looping required (and it automatically creates the nodes list from the edge data).

BEGIN PROGRAM Python. 
import networkx as nx
import spss, spssdata

alldata = spssdata.Spssdata().fetchall()  #get SPSS data
G = nx.DiGraph()                          #create empty graph
G.add_edges_from(alldata)                 #add edges into graph
print G.nodes()
END PROGRAM.

Note now that we have the graph object G in the local python environment for this particular SPSS session. We can then make our own functions that references G, but takes other inputs. Here I have examples for the geodesic distance between two nodes, closeness and degree centrality, and the average degree of the neighbors.

BEGIN PROGRAM Python.
#path distance
def geo_dist(source,target): 
  return nx.shortest_path_length(G,source,target)
#closeness centrality
def close_cent(v):
  return nx.closeness_centrality(G,v)
#degree
def deg(v):
  return G.degree(v)
#average degree of neighbors
def avg_neideg(v):
  return nx.average_neighbor_degree(G,nodes=[v])[v]
END PROGRAM.

Here is the node list in a second SPSS dataset that we will calculate the mentioned statistics on. For large graphs, this is nice because you can select out a smaller subset of nodes and only worry about the calculations on that subset. For a crime analysis example, I may be monitoring a particular set of chronic offenders, and I want to calculate how close every arrested person within the month is to the set of chronic offenders.

DATA LIST FREE / Employ (F1.0). 
BEGIN DATA 
1 
2
3
4
5
6
7
8
END DATA. 
DATASET NAME Emp. 
DATASET ACTIVATE Emp.

Now we have all the necessary ingredients to calculate our network statistics on these nodes. Here are examples of using SPSSINC TRANS to calculate the network statistics in the local SPSS dataset.

*Geodesic distance from 1.
SPSSINC TRANS RESULT=Dist TYPE=0
  /FORMULA "geo_dist(source=1.0,target=Employ)".

*closeness centrality.
SPSSINC TRANS RESULT=Cent TYPE=0
  /FORMULA "close_cent(v=Employ)".

*degree.
SPSSINC TRANS RESULT=Deg TYPE=0
  /FORMULA "deg(v=Employ)".

*Average neighbor degree.
SPSSINC TRANS RESULT=NeighDeg TYPE=0
  /FORMULA "avg_neideg(v=Employ)".

Laplacian Centrality in NetworkX (Python)

The other day I read a few papers on a new algorithm for calculating centrality in networks. Below are the two papers describing the Laplacian Centrality metric. The first is for non-weighted networks, and the second for weighted networks.

  • Qi, X., Duval, R. D., Christensen, K., Fuller, E., Spahiu, A., Wu, Q., Wu, Y., Tang, W., and Zhang, C. (2013). Terrorist networks, network energy and node removal: A new measure of centrality based on laplacian energy. Social Networking, 02(01):19-31.
  • Qi, X., Fuller, E., Wu, Q., Wu, Y., and Zhang, C.-Q. (2012). Laplacian centrality: A new centrality measure for weighted networks. Information Sciences, 194:240-253. PDF Here.

The metric is fairly intuitive I think. The centrality parameter is a function of the local degree plus the degree’s of the neighbors (with different weights for each). I figured it would be a quick programming exercise (which means I spent way too long trying to implement it!). To follow is some code that replicates the measures for both weighted and non-weighted graphs, using the Python networkx library.

The non-weighted graph code is easy, and is a near copy-paste from some igraph code snippet that was already available. Just some updates to idiom’s for NetworkX specifically. The norm option specifies whether you want solely the numerator value, the difference between the energy in the full graph versus the graph with the node removed (norm=False), or whether you want to divide this value by the energy for the full graph. Note this function ignores the weights in the graph. nbunch is if you want to pass the function only a subset of points to calculate the centrality. (If you do that you might as well have norm=False for time savings as well.)

def lap_cent(graph, nbunch=None, norm=False):
  if nbunch is None:
    vs = graph.nodes()
  else:
    vs = nbunch
  degrees = graph.degree(weight=None)
  if norm is True:
    den = sum(v**2 + v for i,v in degrees.items())
    den = float(den)
  else:
    den = 1
  result = []
  for v in vs:
    neis = graph.neighbors(v)
    loc = degrees[v]
    nei = 2*sum(degrees[i] for i in neis)
    val = (loc**2 + loc + nei)/den
    result.append(val)
  return result

The weighted network is a bit more tricky though. I thought coding all of the two walks seemed a royal pain, so I developed a different algorithm that I believe is quicker. Here are three functions, but the last one is the one of interest, lap_cent_weighted. The options are similar to the unweighted version, with the exception that you can pass a weight attribute (which is by default named ‘weight’ in NetworkX graphs).

def lap_energy(graph, weight='weight'):
  degrees = graph.degree(weight=weight)
  d1 = sum(v**2 for i,v in degrees.items())
  wl = 0
  for i in graph.edges(data=True):
    wl += (i[2].get(weight))**2
  return [d1,2*wl]

def cw(graph,node,weight='weight'):
  neis = graph.neighbors(node)
  ne = graph[node]
  cw,sub = 0,0
  for i in neis:
    we = ne[i].get(weight)
    od = graph.degree(i,weight=weight)
    sub += -od**2 + (od - we)**2
    cw += we**2
  return [cw,sub]

def lap_cent_weighted(graph, nbunch=None, norm=False, weight='weight'):
  if nbunch is None:
    vs = graph.nodes()
  else:
    vs = nbunch
  if norm == True:
    fe = lap_energy(graph,weight=weight)
    den = float(fe[0]+fe[1])
  else:
    den = 1
  result = []
  for i in vs:
     d2 = graph.degree(i,weight=weight)
     w2 = cw(graph,i,weight=weight)
     fin = d2**2 - w2[1] + 2*w2[0]
     result.append(fin/den)
  return result

For a brief overview of the new algorithm (in some quick and dirty text math), to define the energy of the entire graph it is:

sum(di^2) + 2*sum(wij^2)   (1)

Where di are the degrees for all i nodes, and the second term is 2 times the sum of the weights squared. So when you take out a particular node, say ‘A’, the drop in the second term is easy, just iterate over the neighbors of ‘A’, and calculate 2*sum(waj^2), then subtract that from the second term in equation 1.

The first term is slightly more complex. First there is a decrease due to simply the degree of the removed node, da^2. There is also a decrease in the degree on the neighboring nodes as well, so you need to calculate their updated contribution. The necessary info. is available when you iterate over the neighbor list though, and if the original contribution is di^2, and the weight of wia, then the updated weight is -di^2 + (di - wia)^2. You can calculate this term at the same time you calculate the decrease in the weights in the second term.

I believe this algorithm is faster than the one originally written in the second paper. It should be something like O(n*a), where a is the average number of neighbors for the entire graph, and n are the number of nodes. Or in worst case a is the maximum number of neighbors any node has in the graph (which should be less than or equal to the max degree in a weighted graph).

Here is an example of using the functions with the small, toy example in the weighted network paper. Note that the lap_cent function ignores the weights.

import networkx as nx

Gp = nx.Graph()
ed = [('A','B',4),('A','C',2),('C','B',1),('B','D',2),('B','E',2),('E','F',1)]
Gp.add_weighted_edges_from(ed)

x = lap_cent(Gp)
xw = lap_cent_weighted(Gp, norm=True)
for a,b,c in zip(Gp.nodes(),x,xw):
  print a,b,c

Which prints out at the console:

A 18 0.7 
C 18 0.28 
B 34 0.9 
E 16 0.26 
D 10 0.22 
F 6 0.04

If you want to see that the graph is the same as the graph in the weighted Qi paper, use below.

import matplotlib.pyplot as plt
pos=nx.spring_layout(Gp) # positions for all nodes
nx.draw(Gp,pos=pos)
nx.draw_networkx_labels(Gp,pos=pos)
nx.draw_networkx_edge_labels(Gp,pos=pos)
plt.show()

Fuzzy matching in SPSS using a custom python function

The other day I needed to conduct propensity score matching, but I was working with geographic data and wanted to restrict the matches to within a certain geographic distance. To do this I used the FUZZY extension command, which allows you to input a custom function. To illustrate I will be using some example data from my dissertation, and the code and data can be downloaded here.

So first lets grab the data and reduce it down a bit to only the variables we will be using. This dataset are street segments and intersections in DC, and the variables are crime, halfway houses, sidewalk cafes, and bars. Note to follow along you need to update the file handle to your machine.

FILE HANDLE save /NAME = "!!!Your Handle Here!!!".
GET FILE = "save\BaseData.sav".
DATASET NAME DC_Data.
SORT CASES BY MarID.

*Reduce the variable list down a bit.
MATCH FILES FILE = * /KEEP  MarID XMeters YMeters OffN1 OffN2 OffN3 OffN4 OffN5 OffN6 OffN7 OffN8 OffN9 
                            TotalCrime HalfwayHouse SidewalkCafe TypeC_D.

Now as a quick illustration, I am going to show a propensity score analysis predicting the location of halfway houses in DC – and see if street units with a halfway house are associated with more violence. Do not take this as a serious analysis, just as an illustration of the workflow. The frequency shows there are only 9 halfway houses in the city, and the compute statements collapse crimes into violent and non-violent. Then I use PLUM to fit the logistic model predicting the probability of treatment. I use non-violent crimes, sidewalk cafes, and bars as predictors.

FREQ HalfwayHouse.
COMPUTE Viol = OffN1 + OffN4 + OffN5 + OffN6.
COMPUTE NonViol = OffN2 + OffN3 + OffN7 + OffN8 + OffN9.

*Fitting logit model via PLUM.
PLUM HalfwayHouse WITH NonViol SidewalkCafe TypeC_D
  /CRITERIA=CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8)
  /LINK=LOGIT
  /PRINT=FIT PARAMETER SUMMARY
  /SAVE=ESTPROB.

The model is very bad, but we can see that sidewalk cafes are never associated with a halfway house! (Again this is just an illustration – don’t take this as a serious analysis of the effects of halfway houses on crime.) Now we need to make a custom function with which to restrict matches not only based on the probability of treatment, but also based on the geographic location. Here I made a file named DistFun.py, and placed in it the following functions:

#These functions are for SPSS's fuzzy case control matching
import math
#distance under 500, and caliper within 0.02
def DistFun(d,s):
  dx = math.pow(d[1] - s[1],2)  
  dy = math.pow(d[2] - s[2],2)  
  dis = math.sqrt(dx + dy)
  p = abs(d[0] - s[0])
  if dis < 500 and p < 0.02:
    t = 1
  else:
    t = 0
  return t
#distance over 500, but under 1500
def DistBuf(d,s):
  dx = math.pow(d[1] - s[1],2)  
  dy = math.pow(d[2] - s[2],2)  
  dis = math.sqrt(dx + dy)
  p = abs(d[0] - s[0])
  if dis  500 and p < 0.02:
    t = 1
  else:
    t = 0
  return t

The FUZZY command expects a function to return either a 1 for a match and 0 otherwise, and the function just takes a fixed set of vectors. The first function DistFun, takes a list where the first two elements are the coordinates, and the last element is the probability of treatment. It then calculates the euclidean distance, and returns a 1 if the distance is under 500 and the absolute distance in propensity scores is under 0.02. The second function is another example if you want matches not too close but not too far away, at a distance of between 500 and 1500. (In this dataset my coordinates are projected in meters.)

Now to make the research reproducible, what I do is save this python file, DistFun.py, in the same folder as the analysis. To make this an importable function in SPSS for FUZZY you need to do two things. 1) Also have the file __init__.py in the same folder (Jon Peck made the comment this is not necessary), and 2) add this folder to the system path. So back in SPSS we can add the folder to sys.path and check that our function is importable. (Note that this is not permanent change to the PATH system variable in windows, and is only active in the same SPSS session.)

*Testing out my custom function.
BEGIN PROGRAM Python.
import sys
sys.path.append("!!!Your\\Path\\Here!!!\\")

import DistFun

#test case
x = [0,0,0.02]
y = [0,499,0.02]
z = [0,500,0.02]
print DistFun.DistFun(x,y)
print DistFun.DistFun(x,z)
END PROGRAM.

Now we can use the FUZZY command and supply our custom function. Without the custom function you could specify the distance in any one dimension on the FUZZ command (e.g. here something like FUZZ = 0.02 500 500), but this produces a box, not a circle. Also with the custom function you can do more complicated things, like my second buffer function. The function takes the probability of treatment along with the two spatial coordinates of the street unit.

*This uses a custom function I made to restrict matches to within 500 meters.
FUZZY BY=EST2_1 XMeters YMeters SUPPLIERID=MarID NEWDEMANDERIDVARS=Match1 Match2 Match3 GROUP=HalfwayHouse CUSTOMFUZZ = "DistFun.DistFun"
    EXACTPRIORITY=FALSE  
MATCHGROUPVAR=MGroup 
/OPTIONS SAMPLEWITHREPLACEMENT=FALSE MINIMIZEMEMORY=TRUE SHUFFLE=TRUE SEED=10.

This takes less than a minute, and in this example provides a full set of matches for all 9 cases (not surprising, since the logistic regression equation predicting halfway house locations is awful). Now to conduct the propensity score analysis just takes alittle more data munging. Here I make a second data of just the matched locations, and then reshape the cases and controls so they are in long format. Then I merge the original data back in.

*Reshape, merge back in, and then conduct outcome analysis.
DATASET COPY PropMatch.
DATASET ACTIVATE PropMatch.
SELECT IF HalfwayHouse = 1.
VARSTOCASES /MAKE MarID FROM MarID Match1 Match2 Match3
            /INDEX Type
            /KEEP MGroup.

*Now remerge original data back in.
SORT CASES BY MarID.
MATCH FILES FILE = *
  /TABLE = 'DC_Data'
  /BY MarID. 

Now you can conduct the analysis. For example most people use t-tests both for the outcome and to assess balance on the pre-treatment variables.

*Now can do your tests.
T-TEST GROUPS=HalfwayHouse(0 1)
  /MISSING=ANALYSIS
  /VARIABLES=Viol
  /CRITERIA=CI(.95).

One of my next projects will be to use this workflow to conduct fuzzy name matching within and between police databases using custom string distance functions.

Passing arguments to SPSSINC TRANS (2)

Jon Peck made some great comments on my prior post on passing arguments to the SPSSINC TRANS function. Besides advice on that I should be quoting the argument on the FORMULA statement, he gave examples of how you can use the "TO" argument in both passing variables lists within the python formula and assigning variables to the results. Here is a brief example of their use.

First I will be working with a tiny, toy dataset:

DATA LIST FREE / X1 TO X4.
BEGIN DATA
1 2 3 4
5 6 7 8
9 8 7 6
5 4 3 2
END DATA.
DATASET NAME Test.

Now here is a command that returns the second lowest value in a list. (While there are plenty of things you can do in base code, this python code is very simple compared to what you would have to do in vanilla SPSS to figure this out.) In a nutshell you can specify the variable list on the /VARIABLE subcommand (and mix in TO to specify adjacent variables as in most SPSS commands). And then insert these into the python formula by specifying <>.

SPSSINC TRANS RESULT = Second
  /VARIABLES X1 TO X4
  /FORMULA "sorted([<>])[1]".

In my prior post, I showed how you could do this for the original variables, which would look like /FORMULA "sorted([X1,X2,X3,X4])[1]". Here you can see I’ve specified a set of variables on the VARIABLES subcommand, and inserted them into a list using [<>]. Enclosing <> in brackets produces a list in python. I then sort the list and grab the second element (located at 1, since python uses 0 based indices). You can also mix variables in the dataset and the <> listed on the variables subcommand. See here for an example.

You can also use the TO modifier in making a new set of variables. Here I return the sorted variables X1 TO X4 as a new set of variables S1 TO S4.

SPSSINC TRANS RESULT = S1 TO S4
  /VARIABLES X1 TO X4
  /FORMULA "sorted([<>])".

In both the prior examples I omitted the TYPE argument, as it defaults to 0 (i.e. a numeric variable returned as a float). But when specifying variable lists of the same type for multiple variables you can simply specify the type one time and the rest of the results are intelligently returned as the same. Here is the same sorted example, except that I return the results each as a string of length 1 as opposed to a numeric value.

SPSSINC TRANS RESULT = A1 TO A4 TYPE = 1
  /VARIABLES X1 TO X4 
  /FORMULA "map(str, sorted([<>]))".

Passing arguments to SPSSINC TRANS

So I actually bothered to read the help the other day for SPSSINC TRANS, which being generic allows you to use Python functions similar to how COMPUTE statements work, just a bit more general. Two examples of passing arguments I did not know you could do were 1) pass a list as an argument, and 2) pass constants that aren’t SPSS variables to functions. To follow are a few brief examples.

The first is passing a list to a function, and here is a simple example using the Python function sorted().

DATA LIST FREE / X1 X2 X3.
BEGIN DATA
3 2 1
1 0 3
1 1 2
2 2 1
3 0 3
END DATA.
DATASET NAME Test.

SPSSINC TRANS RESULT=S1 S2 S3 TYPE=0
  /FORMULA sorted([X1,X2,X3]).

This takes the variables X1 to X3, sorts them, and returns them in a new set of variables S1 to S3. We can also do reverse sorting by passing a constant value of 1 to the reverse function, which acts synonymously with reverse=True.

SPSSINC TRANS RESULT=RS1 RS2 RS3 TYPE=0
  /FORMULA sorted([X1,X2,X3],reverse=1).

This is a rather simplistic example, but the action is much simpler in Python than whatever equivalent SPSS code you can come up with. When using the SPSSINC TRANS extension it expects the returned function to simply be a flat list. For this sorting situation though it might be convenient to return the order in which the original value was stored. Here I make a function that returns the indice of the original list, and then flattens the two into sequential order, per this SO answer.

BEGIN PROGRAM Python.
import itertools

def SortList(L,reverse=0):
  I = range(1,len(L)+1)
  x = sorted(zip(L,I),reverse=reverse)
  r = list(itertools.chain.from_iterable(x))
  return r

#example use
print SortList(L=[2,1,3])
print SortList(L=[2,1,3],reverse=1)
END PROGRAM.

MATCH FILES FILE = * /DROP S1 TO RS3.

SPSSINC TRANS RESULT= S1 T1 S2 T2 S3 T3 TYPE=0
  /FORMULA SortList([X1,X2,X3],reverse=1).

When passing a string constant to a function in SPSSINC TRANS you need to triple quote the string. This makes some of my prior examples of using the Google maps related API’s much simpler. Instead of making variables to pass to the function, you can just triple quote the constants. Also when using the maps API I often have an argument for the API key, but you will get results even without a key (I presume Google just checks the IP address an limits you after so many requests). So for many of my functions you can not worry about making an API key and just pass an empty string. Here is an example from my prior Google distance API post using string constants and no API key.

BEGIN PROGRAM Python.
import urllib, json

#This parses the returned json to pull out the distance in meters and
#duration in seconds, [None,None] is returned is status is not OK
def ExtJsonDist(place):
  if place['rows'][0]['elements'][0]['status'] == 'OK':
    meters = place['rows'][0]['elements'][0]['distance']['value']
    seconds = place['rows'][0]['elements'][0]['duration']['value']
  else:
    meters,seconds = None,None
  return [meters,seconds]

#Takes a set of lon-lat coordinates for origin and destination,
#plus your API key and returns the json from the distance API
def GoogDist(OriginX,OriginY,DestinationX,DestinationY,key):
  MyUrl = ('https://maps.googleapis.com/maps/api/distancematrix/json'
           '?origins=%s,%s'
           '&destinations=%s,%s'
           '&key=%s') % (OriginY,OriginX,DestinationY,DestinationX,key)
  response = urllib.urlopen(MyUrl)
  jsonRaw = response.read()
  jsonData = json.loads(jsonRaw)
  data = ExtJsonDist(jsonData)
  return data
END PROGRAM.

*Grab the online data.
DATASET CLOSE ALL.
SPSSINC GETURI DATA
URI="https://dl.dropboxusercontent.com/u/3385251/NewYork_ZipCentroids.sav"
FILETYPE=SAV DATASET=NY_Zips.

*Selecting out only a few.
SELECT IF $casenum <= 5.
EXECUTE.

SPSSINC TRANS RESULT=Meters Seconds TYPE=0 0 
/FORMULA GoogDist(OriginX=LongCent,OriginY=LatCent,DestinationX='''-78.276205''',DestinationY='''42.850721''',key=''' ''').

Extracting items from SPSS tables using Python

Sometimes there are calculations provided for in SPSS tables that are necessary to use for other calculations. A frequent one is to grab certain percentiles from a FREQUENCY table (Equal Probability Histograms in SPSS is one example). The typical way to do this is to grab the table using OMS, but where that is overkill is if you need to merge this info. back into the original data for further calculations. I will show a brief example of grabbing the 25th, 50th, and 75th percentiles from a frequency table and using Python to calculate a robust standardized variable using these summary statistics.

First we will make a set of random data to work with.

SET SEED 10.
MATRIX.
SAVE {UNIFORM(100,1)} /OUTFILE = *.
END MATRIX.
DATASET NAME U.
FREQ COL1 /FORMAT = NOTABLE /PERCENTILES = 25 50 75.

The frequency table we are working with then looks like:

Now to get to the items in this frequency table we just to do a bit of going down a rabbit hole of different python objects.

  • The first block grabs the items in the output, which include tables and text.
  • The second block then grabs the last table for this specific output. Note that minus 2 from the size of the list is necessary because Python uses zero based indices and there is a log item after the table. So if the size of the list is 10, that means list[9] is the last item in the list. (Using negative indices does not work for extracting from the OutputItemList object.)
  • The third part then grabs the quantiles from the indices of the table. It ends up being in the first data column (so zero) and in the 3rd, 4th and 5th rows (again, Python uses zero based indices). Using GetUnformattedValueAt grabs the floating point number.
  • The final part then uses these quantiles to calculate a robust normalized variable by using spss.Submit and string substitution. (And then closes the SPSS client at the end.)

BEGIN PROGRAM Python.
import SpssClient, spss

#start the client, grab the items in the output
SpssClient.StartClient()
OutputDoc = SpssClient.GetDesignatedOutputDoc()
OutputItemList = OutputDoc.GetOutputItems()

#Grab the last table, 0 based index
lastTab = OutputItemList.Size() - 2
OutputItem = OutputItemList.GetItemAt(lastTab)
PivotTable = OutputItem.GetSpecificType()
SpssDataCells = PivotTable.DataCellArray()

#Grab the specific quantiles
Q25 = float(SpssDataCells.GetUnformattedValueAt(2,0))
Q50 = float(SpssDataCells.GetUnformattedValueAt(3,0))
Q75 = float(SpssDataCells.GetUnformattedValueAt(4,0))
print [Q25,Q50,Q75]

#Use these stats in SPSS commands
spss.Submit("COMPUTE QuantNormX = ( COL1 - %(Q50)f )/( %(Q75)f - %(Q25)f )." % locals())
SpssClient.StopClient()
END PROGRAM.

While the python code in terms of complexity is just about the same as using OMS to grab the frequency table and merge the quantiles back into the original data, this will be much more efficient. I can imagine using this for other projects too, like grabbing coefficients from a regression model and estimating certain marginal effects.

Using the New York State Online Geocoding API with Python

I’ve been very lucky doing geographic analysis in New York state, as the majority of base map layers I need, and in particular streets centerline files for geocoding, are available statewide at the NYS GIS Clearing house. I’ve written in the past how to use various Google API’s for geo data, and here I will show how one can use the NYS SAM Address database and their ESRI online geocoding service. I explored this because Google’s terms of service are restrictive, and the NYS composite locator should be more comprehensive/up to date in matches (in theory).

So first, this is basically the same as with most online API’s (at least in my limited experience), submit a particular url and get JSON in return. You just then need to parse the JSON for whatever info you need. This is meant to be used within SPSS, but the function works with just a single field address string and returns the single top hit in a list of length 3, with the unicode string address, and then the x and y coordinates. (The function is of course a valid python function, so you could use this in any environment you want.) The coordinates are specified using ESRI’s WKID (see the list for projected and geographic coordinate systems). In the code I have it fixed as WKID 4326, which is WGS 1984, and so returns the longitude and latitude for the address. When the search returns no hits, it just returns a list of [None,None,None].

*Function to use NYS geocoding API.
BEGIN PROGRAM Python.
import urllib, json

def ParsNYGeo(jBlob):
  if not jBlob['candidates']:
    data = [None,None,None]
  else:
    add = jBlob['candidates'][0]['address']
    y = jBlob['candidates'][0]['location']['y']
    x = jBlob['candidates'][0]['location']['x']
    data = [add,x,y]
  return data

def NYSGeo(Add, WKID=4326):
  base = "http://gisservices.dhses.ny.gov/arcgis/rest/services/Locators/SAM_composite/GeocodeServer/findAddressCandidates?SingleLine="
  wkid = "&maxLocations=1&outSR=4326"
  end = "&f=pjson"
  mid = Add.replace(' ','+')
  MyUrl = base + mid + wkid + end
  soup = urllib.urlopen(MyUrl)
  jsonRaw = soup.read()
  jsonData = json.loads(jsonRaw)
  MyDat = ParsNYGeo(jsonData)
  return MyDat

t1 = "100 Washington Ave, Albany, NY"
t2 = "100 Washington Ave, Poop"

Out = NYSGeo(t1)
print Out

Empt = NYSGeo(t2)
print Empt
END PROGRAM.

So you can see in the code sample that you need both the street address and the city in one field. And here is a quick example with some data in SPSS. Just the zip code doesn’t return any results. There is some funny results here though in this test run, and yes that Washington Ave. extension has caused me geocoding headaches in the past.

*Example using with SPSS data.
DATA LIST FREE / MyAdd (A100).
BEGIN DATA
"100 Washington Ave, Albany"
"100 Washinton Ave, Albany"
"100 Washington Ave, Albany, NY 12203"
"100 Washington Ave, Albany, NY, 12203"
"100 Washington Ave, Albany, NY 12206"
"100 Washington Ave, Poop"
"12222"
END DATA.
DATASET NAME NY_Add.

SPSSINC TRANS RESULT=GeoAdd lon lat TYPE=100 0 0 
  /FORMULA NYSGeo(Add=MyAdd).

LIST ALL.