Estimating group based trajectory models using SPSS and R

For a project I have been estimating group based trajectory models for counts of crime at micro places. Synonymous with the trajectory models David Weisburd and colleagues estimated for street segments in Seattle. Here I will show how using SPSS and the R package crimCV one can estimate similar group based trajectory models. Here is the crimCV package help and here is a working paper by the authors on the methodology. The package specifically estimates a zero inflated poisson model with the options to make the 0-1 part and/or the count part have quadratic or cubic terms – and of course allows you specify the number of mixture groups to search for.

So first lets make a small set of fake data to work with. I will make 100 observations with 5 time points. The trajectories are three separate groups (with no zero inflation).

*Make Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 100.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME OrigData.
*Make 3 fake trajectory profiles.
VECTOR Count(5,F3.0).
DO REPEAT Count = Count1 TO Count5 /#iter = 1 TO 5.
COMPUTE #i = #iter - 3.
COMPUTE #ii  = #i**2.
COMPUTE #iii = #i**3.
  DO IF Id <= 30.
    COMPUTE #P    = 10 + #i*0.3 + #ii*-0.1 + #iii*0.05.
    COMPUTE Count = RV.POISSON(#P).
  ELSE IF Id <=60.
    COMPUTE #P    =  5 + #i*-0.8 + #ii*0.3 + #iii*0.05.
    COMPUTE Count = RV.POISSON(#P).
  ELSE. 
    COMPUTE #P    =  4 + #i*0.8 + #ii*-0.5 + #iii*0.
    COMPUTE Count = RV.POISSON(#P).
  END IF.
END REPEAT.
FORMATS Id Count1 TO Count5 (F3.0).
EXECUTE.

Note The crimCV package wants the data to be wide format for the models, that is each separate time point in its own column. Now we can call R code using BEGIN PROGRAM R to recreate the wide SPSS dataset in an R data frame.

*Recreate SPSS data in R data frame.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS(variables=c("Id","Count1","Count2",
                                                "Count3","Count4","Count5")) #grab data from SPSS
casedataMat <- as.matrix(casedata[,2:6]) #turn into matrix
#summary(casedata) #check contents
#casedataMat[1:10,1:5]
END PROGRAM.

Now to fit one model with 3 groups (without calculating the cross validation statistics) the code would be as simple as:

*Example estimating model with 3 groups and no CV.
BEGIN PROGRAM R.
library(crimCV)
crimCV(casedataMat,3,rcv=FALSE,dpolyp=3,dpolyl=3)
END PROGRAM.

But when we are estimating these group based trajectory models we don’t know the number of groups in advance. So typically one progressively fits more groups and then uses model selection criteria to pick the mixture solution that best fits the data. Here is a loop I created to successively estimate models with more groups and stuffs the models results in a list. It also makes a separate data frame that saves the model fit statistics, so you can see which solution fits the best (at least based on these statistics). Here I loop through estimates of 1 through 4 groups (this takes about 2 minutes in this example). Be warned – here are some bad programming practices in R (the for loops are defensible, but growing the lists within the loop is not – they are small though in my application and I am lazy).

*looping through models 1 through 4.
BEGIN PROGRAM R.
results <- c()  #initializing a set of empty lists to store the seperate models
measures <- data.frame(cbind(groups=c(),llike=c(),AIC=c(),BIC=c(),CVE=c())) #nicer dataframe to check out model 
                                                                            #model selection diagnostics
max <- 4 #this is the number of grouping solutions to check

#looping through models
for (i in 1:max){
    model <- crimCV(casedataMat,i,rcv=TRUE,dpolyp=3,dpolyl=3)
    results <- c(results, list(model))
    measures <- rbind(measures,data.frame(cbind(groups=i,llike=model$llike,
                                          AIC=model$AIC,BIC=model$BIC,CVE=model$cv)))
    #save(measures,results,file=paste0("Traj",as.character(i),".RData")) #save result
    }
#table of the model results
measures
END PROGRAM.

In my actual application the groups take a long time to estimate, so I have the commented line saving the resulting list in a file. Also if the model does not converge it breaks the loop. So here we see that the mixture with 3 groups is the best fit according to the CVE error, but the 2 group solution would be chosen by AIC or BIC criteria. Just for this tutorial I will stick with the 3 group solution. We can plot the predicted trajectories right within R by selecting the nested list.

*plot best fitting model.
BEGIN PROGRAM R.
plot(results[[3]])
#getAnywhere(plot.dmZIPt) #this is the underlying code
END PROGRAM.

Now the particular object that stores the probabilities is within the gwt attribute, so we can transform this to a data frame, merge in the unique identifier, and then use the STATS GET R command to grab the resulting R data frame back into an SPSS dataset.

*Grab probabiltiies back SPSS dataset.
BEGIN PROGRAM R.
myModel <- results[[3]] #grab model
myDF <- data.frame(myModel$gwt) #probabilites into dataframe
myDF$Id <- casedata$Id #add in Id
END PROGRAM.
*back into SPSS dataset.
STATS GET R FILE=* /GET DATAFRAME=myDF DATASET=TrajProb.

Then we can merge this R data frame into SPSS. After that, we can classify the observations into groups based on the maximum posterior probability of belonging to a particular group.

*Merge into original dataset, and the assign groups.
DATASET ACTIVATE OrigData.
MATCH FILES FILE = *
  /FILE = 'TrajProb'
  /BY ID
  /DROP row.names.
DATASET CLOSE TrajProb.
*Assign group based on highest prob.
*If tied last group wins.
VECTOR X = X1 TO X3.
COMPUTE #MaxProb = MAX(X1 TO X3).
LOOP #i = 1 TO 3.
  IF X(#i) = #MaxProb Group = #i.
END LOOP.
FORMATS Group (F1.0).

Part of the motivation for doing this is not only to replicate some of the work of Weisburd and company, but that it has utility for identifying long term hot spots. Part of what Weisburd (and similarly I am) finding is that crime at small places is pretty stable over long periods of time. So we don’t need to make up to date hotspots to allocate police resources, but are probably better off looking at crime over much longer periods to identify places for targeted strategies. Trajectory models are a great tool to help identify those long term high crime places, same as geographic clustering is a great tool to help identify crime hot spots.

15 Comments

by Andy Wheeler on August 12, 2014 • Permalink

Posted in R, SPSS

Tagged clustering, Crime Analysis, data-manipulation, r, SPSS, time-series

Posted by Andy Wheeler on August 12, 2014

https://andrewpwheeler.com/2014/08/12/estimating-group-based-trajectory-models-using-spss-and-r/

15 Comments

Jon Peck
/ August 13, 2014

One thing that caught my eye in this is the use of STATS GET R to create the new data generated by the R code as a Statistics dataset. Since there are already apis in the R plug-in to do that, it never occurred to me when I wrote GET R that using it would sometimes be an easier way to do this. The apis are a little difficult to figure out, and using GET R really simplifies it.

Reply
- apwheele
  / August 13, 2014
  
  Yes the STATS GET R command is much simpler than doing all of the variable specifications to create the dictionary yourself (thank you!). I remember in one case in some of my older code I simply exported to csv and then read the csv in SPSS when I got frustrated troubleshooting code to set the dictionary!
  
  Reply
Gcirco
/ September 24, 2015

I’d like to thank you for introducing me to this CrimCV program. I’ve been searching for a program to run trajectory analyses in R (I’d rather not use Proc Traj in SAS or STATA). Your code for looping the results is especially helpful.

Do you know how to create custom plots using output from the crimCV results? I’d like to pass this data into ggplot for better visualization, but for the life of me, I can’t seem to find out what data is used to create plots.

Reply
- apwheele
  / September 24, 2015
  
  I know what you mean, as it is not well documented. I have in the code a commented out example of looking at the plotting code to get the trajectories.
  
  #getAnywhere(plot.dmZIPt) #this is the underlying code
  
  I will see if I can make some time to put together code to plot the trajectories using ggplot2. There are a few different ways to do it, and that code is the expected trajectories from the model. If you see my paper, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2585987, I do different plots of the individual trajectories as well.
  
  Reply
  - Gcirco
    / September 24, 2015
    
    Interesting! Not as easy to do as I thought, utilizing matplot. Well, I’ll continue to plug away. Thanks very much for your reply. I’ve enjoyed your blog, as I’ve been working on similar geospatial problems analyzing crime in Detroit.
Jason
/ September 27, 2018

Greetings Andrew. I appreciate this post and your follow-up with the plots. In SAS Proc Traj, you can assign separate ordered polynomials for each of the groups — my attempts to do so in crimCV have failed. Do you know if this is possible? Thanks!

Reply
- apwheele
  / September 27, 2018
  
  I don’t know what you mean exactly.
  
  Reply
  - Jason
    / September 27, 2018
    
    Sorry if I was unclear. Let’s say you are testing a model with 3 groups, in SAS Proc Traj, you can indicate the order of polynomial trends to test for each of the groups. For instance, you can specify a quadratic for group 1, linear for group 2, and so on. It seems with crimCV, you can only choose one trend that is applied to each of the groups.
  - apwheele
    / September 28, 2018
    
    You can’t do that in the crimCV package unfortunately. You can do it in the Proc Traj package for Stata though FYI.
Regina
/ June 14, 2022

Many thanks for your introduction to the crimCV package and the code for plotting graphs! I was wondering if there is an option to incorporate covariates into the model when using the crimCV package? Looking a the RDocumentation of crimCV it doesn’t seem to have this option? If not, are you aware of a different R package for estimating trajectories (either growth mixture modelling or group-based trajectory modelling for counts) that allows for an addition of covariates please?

Reply
- apwheele
  / June 14, 2022
  
  I have been using the FlexMix library more recently, which has the ability to incorporate time varying and time static covariates, see https://andrewpwheeler.com/2022/04/19/state-dependence-and-trajectory-models/ for an example.
  
  Reply
  - Regina
    / June 14, 2022
    
    Many thanks for your quick reply. I’ll explore the option of using the FlexMix package.

Andrew Wheeler

Estimating group based trajectory models using SPSS and R

15 Comments

Jon Peck

apwheele

Gcirco

apwheele

Gcirco

Jason

apwheele

Jason

apwheele

Regina

apwheele

Regina

Leave a comment Cancel reply

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange

Estimating group based trajectory models using SPSS and R

Share this:

Related

15 Comments

Jon Peck

Jason

Regina

Regina

Leave a comment Cancel reply

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange