Calendar Heatmap in SPSS

Here is just a quick example of making calendar heatmaps in SPSS. My motivation can be seen from similar examples of calendar heatmaps in R and SAS (I’m sure others exist as well). Below is an example taken from this Revo R blog post.

The code involves a macro that can take a date variable, and then calculate the row position the date needs to go in the calendar heatmap (rowM), and also returns a variable for the month and year, which are used in the subsequent plot. It is brief enough I can post it here in its entirety.


*************************************************************************************.
*Example heatmap.

DEFINE !heatmap (!POSITIONAL !TOKENS(1)).
compute month = XDATE.MONTH(!1).
value labels month
1 'Jan.'
2 'Feb.'
3 'Mar.'
4 'Apr.'
5 'May'
6 'Jun.'
7 'Jul.'
8 'Aug.'
9 'Sep.'
10 'Oct.'
11 'Nov.'
12 'Dec.'.
compute weekday = XDATE.WKDAY(!1).
value labels weekday
1 'Sunday'
2 'Monday'
3 'Tuesday'
4 'Wednesday'
5 'Thursday'
6 'Friday'
7 'Saturday'.
*Figure out beginning day of month.
compute #year = XDATE.YEAR(!1).
compute #rowC = XDATE.WKDAY(DATE.MDY(month,1,#year)).
compute #mDay = XDATE.MDAY(!1).
*Now ID which row for the calendar heatmap it belongs to.
compute rowM = TRUNC((#mDay + #rowC - 2)/7) + 1.
value labels rowM
1 'Row 1'
2 'Row 2'
3 'Row 3'
4 'Row 4'
5 'Row 5'
6 'Row 6'.
formats rowM weekday (F1.0).
formats month (F2.0).
*now you just need to make the GPL call!.
!ENDDEFINE.

set seed 15.
input program.
loop #i = 1 to 365.
    compute day = DATE.YRDAY(2013,#i).
    compute flag = RV.BERNOULLI(0.1).
    end case.
end loop.
end file.
end input program.
dataset name days.
format day (ADATE10).
exe.

!heatmap day.
exe.
temporary.
select if flag = 1.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=weekday rowM month
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: weekday=col(source(s), name("weekday"), unit.category())
 DATA: rowM=col(source(s), name("rowM"), unit.category())
 DATA: month=col(source(s), name("month"), unit.category())
 COORD: rect(dim(1,2),wrap())
 GUIDE: axis(dim(1))
 GUIDE: axis(dim(2), null())
 GUIDE: axis(dim(4), opposite())
 SCALE: cat(dim(1), include("1.00", "2.00", "3.00", "4.00", "5.00","6.00", "7.00"))
 SCALE: cat(dim(2), reverse(), include("1.00", "2.00", "3.00", "4.00", "5.00","6.00"))
 SCALE: cat(dim(4), include("1.00", "2.00", "3.00", "4.00", "5.00",
  "6.00", "7.00", "8.00", "9.00", "10.00", "11.00", "12.00"))
 ELEMENT: polygon(position(weekday*rowM*1*month), color.interior(color.red))
END GPL.
*************************************************************************************.

Which produces this image below. You can not run the temporary command to see what the plot looks like with the entire year filled in.

This is nice to illustrate potential day of week patterns for specific events that only rarely occur, but you can map any aesthetic you please to the color of the polygon (or you can change the size of the polygons if you like). Below is an example I used this recently to demonstrate what days a spree of crimes appeared on, and I categorically colored certain dates to indicate multiple crimes occurred on those dates. It is easy to see from the plot that there isn’t a real strong tendency for any particular day of week, but there is some evidence of spurts of higher activity.

In terms of GPL logic I won’t go into too much detail, but the plot works even with months or rows missing in the data because of the finite number of potential months and rows in the plot (see the SCALE statements with the explicit categories included). If you need to plot multiple years, you either need seperate plots or another facet. Most of the examples show numerical information over every day, which is difficult to really see patterns like that, but it shouldn’t be entirely disgarded just because of that (I would have to simultaneously disregard every choropleth map ever made if I did that!)

Spineplots in SPSS

So the other day someone on cross-validated asked about visualizing categorical data, and spineplots was one of the responses. The OP asked if a solution in SPSS was available, and there is none currently available, with the exception of calling R code for Mosaic plots, which there is a handy function for that on developerworks. I had some code I started to attempt to make them, and it is good enough to show-case. Some notes on notation, these go by various other names (including Marimekko and Mosaic), also see this developerworkds thread which says spineplot but is something a bit different. Cox (2008) has a good discussion about the naming issues as well as examples, and Wickham & Hofmann (2011) have some more general discussion about different types of categorical plots and there relationships.

So instead of utilizing a regular stacked bar charts, spineplots make the width of the bar proportional to the size of the category. This makes categories with a larger share of the sample appear larger. Below is an example image from a recent thread on CV discussing various ways to plot categorical data.

This should be fairly intuitive what it represents. It is just a stacked bar chart, where the width of the bars on the X axis represent the marginal proportion of that category, and the height of the boxes on the Y axis represent the conditional proportion within each category (hence, all bars sum to a height of one).

Located here I have some of my example code to produce a similar plot all natively within SPSS. Directly below is an image of the result, and below that is an example of the syntax needed to generate the chart. In a nutshell, I provide a macro to generate the coordinates of the boxes and the labels. Then I just provide an example of how to generate the chart in GPL. The code currently sorts the boxes by the marginal totals on each axis, with the largest categories in the lower stack and to the left-most area of the chart. There is an optional parameter to turn this off though, in which case the sorting will be just by ascending order of however the categories are coded (the code has an example of this). I also provide an example at the end calling the R code to produce similar plots (but not shown here).

Caveats should be mentioned here as well, the code currently only works for two categorical variables, and the labels for the categories on the X-axis are labelled via data points within the chart. This will produce bad results with labels that are very close to one another (but at least you can edit/move them post-hoc in the chart editor in SPSS).

I asked Nick Cox on this question if his spineplot package for Stata had any sorting, and he replied in the negative. He has likely thought about it more than me, but I presume they should be sorted somehow by default, and sorting by the marginal totals in the categories was pretty easy to accomplish. I would like to dig into this (and other categorical data visualizations) a bit more, but unfortunately time is limited (and these don’t have much direct connection to my current scholarly work). There is a nice hodge-podge collection at the current question on CV I mentioned earlier (I think I need to add in a response about ParSets at the moment as well).



********************************************************************.
*Plots to make Mosaic Macro, tested on V20.
*I know for a fact V15 does not work, as it does not handle 
*coloring the boxes correctly when using the link.hull function.

*Change this to whereever you save the MosaicPlot macro.
FILE HANDLE data /name = "E:\Temp\MosaicPlot".
INSERT FILE = "data\MacroMosaic.sps".

*Making random categorical data.
dataset close ALL.
output close ALL.

set seed 14.
input program.
loop #i = 1 to 1000.
    compute DimStack = RV.BINOM(2,.6).
    compute DimCol = RV.BINOM(2,.7).
    end case.
end loop.
end file.
end input program.
dataset name cats.
exe.

value labels DimStack
0 'DimStack Cat0'
1 'DimStack Cat1'
2 'DimStack Cat2'.
value labels DimCol
0 'DimCol Cat0'
1 'DimCol Cat1'
2 'DimCol Cat2'.

*set mprint on.
!makespine Cat1 = DimStack Cat2 = DimCol.
*Example Graph - need to just replace Cat1 and Cat2 where appropriate.
dataset activate spinedata.
rename variables (DimStack = Cat1)(DimCol = Cat2).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X2 X1 Y1 Y2 myID Cat1 Cat2 Xmiddle
  MISSING = VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: Y2=col(source(s), name("Y2"))
 DATA: Y1=col(source(s), name("Y1"))
 DATA: X2=col(source(s), name("X2"))
 DATA: X1=col(source(s), name("X1"))
 DATA: Xmiddle=col(source(s), name("Xmiddle"))
 DATA: myID=col(source(s), name("myID"), unit.category())
 DATA: Cat1=col(source(s), name("Cat1"), unit.category())
 DATA: Cat2=col(source(s), name("Cat2"), unit.category())
 TRANS: y_temp = eval(1)
 SCALE: linear(dim(2), min(0), max(1.05))
 GUIDE: axis(dim(1), label("Prop. Cat 2"))
 GUIDE: axis(dim(2), label("Prop. Cat 1 within Cat 2"))
 ELEMENT: polygon(position(link.hull((X1 + X2)*(Y1 + Y2))), color.interior(Cat1), split(Cat2))
 ELEMENT: point(position(Xmiddle*y_temp), label(Cat2), transparency.exterior(transparency."1"))
END GPL.

*This makes the same chart without sorting.
dataset activate cats.
dataset close spinedata.
!makespine Cat1 = DimStack Cat2 = DimCol sort = N.
dataset activate spinedata.
rename variables (DimStack = Cat1)(DimCol = Cat2).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X2 X1 Y1 Y2 myID Cat1 Cat2 Xmiddle
  MISSING = VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: Y2=col(source(s), name("Y2"))
 DATA: Y1=col(source(s), name("Y1"))
 DATA: X2=col(source(s), name("X2"))
 DATA: X1=col(source(s), name("X1"))
 DATA: Xmiddle=col(source(s), name("Xmiddle"))
 DATA: myID=col(source(s), name("myID"), unit.category())
 DATA: Cat1=col(source(s), name("Cat1"), unit.category())
 DATA: Cat2=col(source(s), name("Cat2"), unit.category())
 TRANS: y_temp = eval(1)
 SCALE: linear(dim(2), min(0), max(1.05))
 GUIDE: axis(dim(1), label("Prop. Cat 2"))
 GUIDE: axis(dim(2), label("Prop. Cat 1 within Cat 2"))
 ELEMENT: polygon(position(link.hull((X1 + X2)*(Y1 + Y2))), color.interior(Cat1), split(Cat2))
 ELEMENT: point(position(Xmiddle*y_temp), label(Cat2), transparency.exterior(transparency."1"))
END GPL.
*In code online I have example using Mosaic plot plug in for R.
********************************************************************.

Citations of Interest

Update for Aoristic Macro in SPSS

I’ve substantially updated the aoristic macro for SPSS from what I previously posted. The updated code can be found here. The improvements are;

  • Code is much more modularized, it is only 1 function and takes an Interval parameter to determine what interval summaries you want.
  • It includes Agresti-Coull binomial error intervals (95% Confidence Intervals). It also returns a percentage estimate and the total number of cases the estimate is based off of, besides the usual info for time period, split file, and the absolute aoristic estimate.
  • allows an optional command to save the reshaped long dataset

Functionality dropped are default plots, and saving of begin, end and middle times for the same estimates. I just didn’t find these useful (besides academic purposes).

The main motivation was to add in error bars, as I found when I was making many of these charts it was obvious that some of the estimates were highly variable. While the Agresti-Coull binomial proportions are not entirely justified in this novel circumstance, they are better than nothing to at least illustrate the error in the estimates (it seems to me that they will likely be too small if anything, but I’m not sure).

I think a good paper I might work on in the future when I get a chance to is 1) show how variable the estimates are in small samples, and 2) evaluate the asympotic coverages of various estimators (traditional binomial proportions vs. bootstrap I suppose). Below is an example output of the updated macro, again with the same data I used previously. I make the small multiple chart by different crime types to show the variability in the estimates for given sample sizes.

Interval graph for viz. temporal overlap in crime events

I’ve currently made a vizualization intended to be an exploratory tool to identify overlapping criminal events over a short period. Code to reproduce the macro is here, and that includes an example with made up data. As opposed to aoristic analysis, which lets you see globally the aggregated summaries of when crime events occurred potentially correcting for the unspecified time period, this graphical procedure allows you to identify whether local events overlap. It also allows one to perceive global information as well, in particular whether the uncertainty of events occur in morning, afternoon, or night.

A call to the macro ends up looking like this (other vars is optional and can include multiple variables – I assume token names are self-explanatory);

!interval_data date_begin = datet_begin time_begin = XTIMEBEGIN date_end = datet_end time_end = XTIMEEND 
label_id = myid other_vars = crime rand.

This just produces a new dataset name interval_data, which can then be plotted. And here is the example graph that comes with the macro and its fake data (after I edited the chart slightly).

GGRAPH
  /GRAPHDATASET NAME="graphdataset" dataset = "interval_data" VARIABLES= LABELID DAY TIMEBEGIN TIMEEND MID WITHINCAT DAYWEEK
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: LABELID=col(source(s), name("LABELID"), unit.category())
 DATA: DAY=col(source(s), name("DAY"), unit.category())
 DATA: TIMEBEGIN=col(source(s), name("TIMEBEGIN"))
 DATA: TIMEEND=col(source(s), name("TIMEEND"))
 DATA: MID=col(source(s),name("MID"))
 DATA: WITHINCAT=col(source(s),name("WITHINCAT"), unit.category())
 DATA: DAYWEEK=col(source(s),name("DAYWEEK"), unit.category())
 COORD: rect(dim(1,2), cluster(3,0))
 SCALE: cat(dim(3), values("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23",
                                         "24","25","26","27","28","29","30","31"))
 ELEMENT: interval(position(region.spread.range(WITHINCAT*(TIMEBEGIN + TIMEEND)*DAY)), color.interior(DAYWEEK),
                             transparency.interior(transparency."0.1"), transparency.exterior(transparency."1")))
 ELEMENT: point(position(WITHINCAT*MID*DAY), color.interior(DAYWEEK), label(LABELID))
END GPL.

The chart can be interpreted as the days of the week are colored, and each interval represents on crime event, except when a crime events occurs over night, then the bar is split over two days (and each day the event is labeled). I wanted labels in the chart to easily reference specific events, and I assigned a point to the midpoint of the intervals to plot labels (also to give some visual girth to events that occurred over a short interval – otherwise they would be invisible in the chart). To displace the bars horizontally within the same day the chart essentially uses the same type of procedure that occurs in clustered bar charts.

GPL code can be inserted directly within macros, but it is quite a pain. It is better to use python to paramaterize GGRAPH, but I’m too lazy and don’t have python installed on my machine at work (ancient version of SPSS, V15, is to blame).

Here is another example with more of my data in the wild. This is for thefts from motor vehicles in Troy from the beginning of the year until 2/22. We had a bit of a rash over that time period, but they have since died down after the arrest of one particular prolific offender. This is evident in the chart,

We can also break down by other categories. This is what the token other_vars is for, it carries forward these other variables for use in facetting. For an example Troy has 4 police zones, and here is the graph broken down by each of them. Obviously crime sprees within short time frames are more likley perpetrated in close proximity. Also events committed by the same person are likely to re-occur within the same geographic proximity. The individual noted before was linked to events in Zone 4. I turn the labels off (it is pretty easy to toggle them in SPSS), and then one can either focus on individual events close in time or overlapping intervals pretty easily.

*Panelling by BEAT_DIST.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" dataset = "interval_data" VARIABLES= LABELID DAY TIMEBEGIN TIMEEND MID WITHINCAT DAYWEEK BEAT_DIST
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: LABELID=col(source(s), name("LABELID"), unit.category())
 DATA: DAY=col(source(s), name("DAY"), unit.category())
 DATA: TIMEBEGIN=col(source(s), name("TIMEBEGIN"))
 DATA: TIMEEND=col(source(s), name("TIMEEND"))
 DATA: MID=col(source(s),name("MID"))
 DATA: WITHINCAT=col(source(s),name("WITHINCAT"), unit.category())
 DATA: DAYWEEK=col(source(s),name("DAYWEEK"), unit.category())
 DATA: BEAT_DIST=col(source(s),name("BEAT_DIST"), unit.category())
 COORD: rect(dim(1,2), cluster(3,0))
 SCALE: cat(dim(3), values("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23",
                           "24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45",
                           "46"))
 ELEMENT: interval(position(region.spread.range(WITHINCAT*(TIMEBEGIN + TIMEEND)*DAY*1*BEAT_DIST)), color.interior(DAYWEEK),
                             transparency.interior(transparency."0.1"), transparency.exterior(transparency."1")))
 ELEMENT: point(position(WITHINCAT*MID*DAY*1*BEAT_DIST), color.interior(DAYWEEK), label(LABELID))
END GPL.

Note that to make the X axis insert all of the days I needed to include all of the numerical categories (between 1 and 46) in the value statement in the SCALE statement.

The chart in its current form can potentially be improved in a few ways, but I’ve had trouble accomplishing them so far. One is instead of utilizing clustering to displace the intervals, one could use dodging directly. I have yet to figure out how to specify the appropriate GPL code though when using dodging instead of clustering. Another is to utilize the dates directly, instead of any aribitrary categorical counter since the beginning of the series. To use dodging or clustering the axis needs to be categorical, so you could just make the axis the date, but bin the dates and specify the width of the bin categories to be 1 day (this would also avoid the annoying values statement to specify all the days). Again though, I was not able to figure out the correct GPL code to accomplish this.

I’d like to investigate ways to make this interactive and link with maps as well. I suspect it is possible in D3.js, and if I get around to figuring out how to make such a map/graphic duo I will for sure post it on the blog. In the meantime any thoughts or comments are always appreciated.

Aoristic analysis with SPSS

I’ve written a macro to conduct aoristic analysis with SPSS. Here I will briefly describe what it is, provide alternative references and demonstrate some of its utility on example data from Arlington PD.

In short, crime event data are frequently recorded as occurring within some indefinate time frame. For example, you may park your car and go to work at 08:00, and when you come back out at 04:30 on the same day to find your car window broken and your GPS stolen. Unless there happens to be other witnesses to the crime, you don’t know when the criminal event occurred besides between those two times. Where this is problematic for crime analysis is, you want to be able to look at the distribution of when events occurred, so as to give suggestions to understand why the event is occurring and how to potentially address it. Allocating patrols geographically and temporally to areas of high crime incidence has been regular practice for a long time (Wilson, 1963)! Aoristic analysis is simply a means to take into account that uncertainty of when the event occurred when we examine the overal incidence of crimes occurring across a set of times.

For a very brief illustrative example, lets say we want to know the number of crimes occuring for within the hours of 08:00, 09:00 and 10:00. If we had a criminal event that potentially occurred between 08:00 and 10:00, which is a total time span of 120 minutes (2 hours), instead of counting that event as occurring at 08:00 (the begin time), 10:00 (the end time) or 09:00 (the middle time), we spread the event out over the time frame, and only partially count it within any particular interval. So here it would count as a total of 0.50 weight in both the 08:00 and 09:00 category (60/120=0.25) and assign 0 weight in the 10:00 category (note the weights sum back to the value of 1). This just ends up being a way to estimate the incidence of some event within a given time bin knowing that the event did not necessarily occur in that time bin, so only partially counts towards the total in that bin (where partially is defined by how long the interval is and how much of that interval overlaps with the bin).

Here I illustrate the macro with some examples from the Arlington PD data downloaded on 1/11/2013. This is the only dataset I’ve found publicly available online that has both start and end dates for events (I first looked at NIBRS, and was slightly surprised that they did not have this information). I have the macro code, with examples therein of fake data and the same Arlington PD data, and compare them to this online calculator.

Note, my results will be slightly different than most other programs (including the online app I pointed to) because of one arbitrary (but what I feel is reasonable) coding decision. When an event is over the time period evaluated in the particular estimate (either days or week for my functions), it simply returns the event coded as having equal weight across the time period. Other’s don’t do this as far as I’m aware. So say an event takes place between 08:00 on 1/2/2013 and 10:00 on 1/3/2013. For my functions that only evaluate times over the day, I would return equal probability within every time slot, although some calculators would say there is a higher probability of occurring for times between 08:00 and 10:00 (because of the wrap-around). I believe this practice is a bit of a stretch, and any uncertainty over a day is essentially saying it is totally useless information to determine when during the time of day at all (although my week functions would be equivalent in that example). In those cases the begin and end times say more about when people check there cars, wake up in the morning, get home from work, come back from vacation etc. than they do about when the actual crime occurred.

Some examples

If you want to follow along right within SPSS, I suggest going to the google code site I’ve posted the code and data, but otherwise you can just take my word for it and see how the macro works in practice. I provide several seperate functions to either estimate the frequency of crimes occurring during 1 hour bins over the day, 15 minute bins over the day, days over the week, 1 hour bins over the week, or 15 minute bins over the week.

Here is an example call of the macro and the output from the 1 hour bins over the day with all crimes for the Arlington data.

!aoristic_day1hour begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2.

Date1 and Date2 are the begin and end dates respectively, Time1 and Time2 are the begin and end times respectively. Below is (close to) the automated graph the macro produces, which is just a line chart super-imposing the both the aoristic estimate and what the estimate would be if using the begin, end or middle time. The only differences are I post-hoc edited the aoristoc estimate line to be thicker and in the front (so it is more prominent) plus my personal chart template. Paramaterizing GGRAPH charts to work in macros is quite annoying, and python is a preferable solution (I’m personally happy with just a helper function to return the data in a nice format for me to generate the approprate GGRAPH code to generate the chart myself, there is more power in knowing the grammar than being complacent with the default chart).

 

 

So you can see here that overall, the aoristoc number does not make much of a difference. Here is the same info for the 15 minute bins across the week.

!aoristic_day15min begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2.
 

 

You can see here the aoristic estimate smooths out the data quite a bit more (which is nice above and beyond just worrying about whether one approach is correct or not. With the smaller time bins you can also see patterns to over-report incidents at natural times of hour and half-hour intervals. You can also see midnight, noon and 08:00 are aberently popular times to report either beginning or end times of incidents. You can spot a few other ones as well that differ between begin and end times, for instance it appears 07:00 is a popular end time but not so popular a begin time. The obverse is true for events in the middle afternoon, late evening and early night (e.g. the big spikes in the green begin time line for hours between 17:00 till midnight). Also note that it is near universal that crime dips to its lowest around 4~5 am, and you can see using either the aoristic estimate or the middle point of the event brings the number of events up during this period (as expected).

Also in the set of functions I have the capability to specify an arbitrary category to split the results by, and here is an example splitting the day of week aoristic estimate by the beat variable provided with the Arlington data. Again this isn’t the direct code, but a subsequent GGRAPH command to produce a nicer chart (the original is ok if you make it much bigger, but with so many categories facet wrapping is appropriate to save space).

!aoristic_week begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2 split = Beat.
 

 

The main thing that draws attention in this graph is the difference in levels of calls and different trends between beats (there were no obvious differences in the aoristic estimates versus the naive estimates, which is unsurprising since most incidents don’t have uncertainty of over a day). I know nothing about Arlington, and I don’t know where these beats are, so I can’t say anything about why these differences may potentially occur. In SPSS days start at Sunday (so Sunday = 1 and Saturday = 7). It isn’t that strange to expect slightly more crime on Fridays and Saturdays (people out and about doing things that make them more vulnerable), but for most of the beats showing a flat profile this is not strange either. But it is interesting to see Beat 260 have an atypical pattern of obviously more crimes during the week, and if I had to hazard a guess I would assume there is a middle or high school in Beat 260.

Although you could argue aoristic analysis is called for based purely on theoretical grounds, all these examples show events that it doesn’t make much of a difference whether you use it or simpler methods. Where it is likely to make the most difference though are events which have the longest unknown time intervals. Property crimes tend to be committed when the victim is not around, and so here I compare the aoristic estimate for 15 minute intervals over the day for burglaries, which will show an example where the aoristic estimate makes an actual difference!

 

 

One can see the property crimes have a larger difference for the aoristic estimate across the day, and it is largely flat compared to begin and end times. The end and begin times are likely biased to report when people discovered that the victimization occurred or when they last left their home vulnerable. There is some slight trend for more burglaries to occur during the daytime, and somewhat higher periods during the night (with lulls around 08:00 and 18:00. These are near the exact opposite conclusions you would make with utilizing the begin and end times as to when most burglaries occurred! Middle times results in some weird differences as well, with a high spike in the 01:00 to 04:00 range.

Some closing

This project kicked my butt alittle, and took much longer than I expected. Certainly the code could use improvement and re-factoring, but I’m glad it is done (and seemingly working). You will see there is certainly alot of redundancy between the functions, some temporary variables are computed multiple times, and the week long functions take a while to compute.

In the SPSS macro what I do in a nutshell is make a variable for every time bin, calculate the weight each case has for that time bin, then reshape the dataset wide to long, then at last aggregate the total weight within each time bin. Note this results in many more cases than the original data. For example, the Arlington data I will display later in the post has slightly over 49,000 incidents. For the 15 minute intervals per day (96), this results in over 4.7 million cases (49,000*96). For the 1 hour bins across the week, this results in n times 168 more cases, for the 15 minute bins across the week in n times 672! Subsequently those latter two take an appreciable amount of time to compute for larger datasets (if you don’t run out of local memory on your computer entirely, which I’m guessing could easily happen for some of the older systems and when you have upwards of probably 60,000 cases).

For those interested, the bottle-neck is obviously the VARSTOCASES procedure. But, I have a substantive reason for going through that step, and that is if one wants to use the original data weighted, for say kernel density maps sliced by time of day having the data in that format (long with a field identifying the factor) is more convienant than in the wide format. Thinking about it I could generate NULL data for 0 weight categories and then drop those cases during the VARSTOCASES, but it remains to be seen if that will have much of an appreciable effect on real world datasets. Hopefully in the near future I will get the time to provide examples of that (probably in R using facetting and small multiple maps). If anyone has improvements to the code feel free to send them to me (or just shoot me an email).

In the future I plan on talking about some more visualization techniques to explore crime data with intervals like this. In particular I have a plot manipulating the grammar of graphics a bit to produce a visualization of individual incidents, but it still needs some work and writing up into a nice function. Here is an example though.

 

 


Citations

Wilson, O.W. 1963. Police Administration. McGraw-Hill.

Ratcliffe, Jerry H. 2002. Aoristic signatures and the spatio-temporal analysis of high volume crime patterns. Journal of Quantitative Criminology 18(1): 23-43. PDF Here.

A quick SPSS tip: Using vertical selection in Notepad++ to edit printed MACRO statements

The version of the SPSS syntax editor is really nice and I use it for most of daily analysis. Sometimes though I utlize the text editor Notepadd++ for various tasks that are difficult to accomplish in the SPSS editor. Here I will highlight one instance which I have found Notepad++ to be really helpful, editing printed MACRO statements by using vertical selection.

To start off with a brief example, I have created a very simple MACRO that has an obvious error in it.

**************************************************.
data list free / V1 (F2.0) V2 (F2.0) V3 (A4).
begin data
1 2 aaaa
3 4 bbbb
5 6 cccc
end data.
dataset name input.

DEFINE !example ().
compute X = V1 + V3.
!ENDDEFINE.

set mprint on.

!example.
**************************************************.

When expanded, the printed statement in the output viewer appears like this;

  56  0 M>   
  57  0 M>  . 
  58  0 M>  compute X = V1 + V3 
  59  0 M>  .

Now this is a trivial problem to fix, but what if you have 100’s of line of code and want to edit out all of the beginning text before the commands (e.g. the 59 0 M> part)? It is useful to debug the expanded code because when debugging you can step through the expanded code but not the MACRO code. To edit out the intial lines in Notepad++ is not very hard though because of the ability to utilize vertical selection. If you copy and paste the expanded macro statements into Notepadd++, then press Alt and Shift simultaneously (this is for Windows, I’m not sure about other operating systems), one can vertically select the first 13 columns of text and delete them in one swoop. See picture below to see what I am talking about with vertical selection.

I’ve found having another text editor at my disposal is useful for other tasks as well, so it is something to keep in mind when doing alot of text editing in SPSS anyway. For instance any time I need to find and replace I have much better experience doing it in Notepad++ (and SPSS doesn’t have wildcard find/replace which is obviously helpful in many situations). SPSS syntax files, .sps, are plain text so you can actually just edit those files directly in any text editor you want as well.

Using SPSS as a calculator: Printing immediate calculations

I find it useful sometimes to do immediate calculations when I am in an interactive data analysis session. In either the R or Stata statistical program, this is as simple as evaluating a valid expression. For an example, typing 8762 - 4653 into the R console will return the result of the expression, 4109. SPSS does not come out of the box with this functionality, but I have attempted to replicate it utilizing the PRINT command with temporary variables, and wrap it up in a MACRO for easier use.

The PRINT command can be used to print plain text output, and takes active variables in the dataset as input. For instance in you have a dataset that consists of the following values;

***************************.
data list free / V1 (F2.0) V2 (F2.0) V3 (A4).
begin data
1 2 aaaa
3 4 bbbb
5 6 cccc
end data.
dataset name input.
dataset activate input.
***************************.

If you run the syntax command

***************************.
PRINT /V1.
exe. 
***************************.

The resulting text output (in the output window) will be (Note that for the PRINT command to route text to the output, it needs to be executed);

1
3
5

Now, to make my immediate expression calculator to emulate R or Stata, I do not want to print out all of the cases in the active dataset (as the expression will be a constant, it is not necessary or wanted). So I can limit the number of cases on the PRINT command by using a DO IF and using the criteria $casenum = 1 ($casenum is an SPSS defined variable referring to the row in the dataset). One can then also calculate a temporary variable (represented with a # in the prefix of a variable name) to pass the particular expression to be printed. The below example evaluates 9**4 (nine to the fourth power);

***************************.
DO IF $casenum = 1.
compute #temp = 9**4.
PRINT /#temp.
END IF.
exe.
***************************.

Now we have the ability to pass an expression and have the constant value returned (as long as it would be a valid expression on the right hand side of a compute statement). To make this alittle more automated, one can write a macro that evaluates the expression.

***************************.
DEFINE !calc (!POSITIONAL !CMDEND).
DO IF $casenum = 1.
compute #temp = !1.
PRINT /#temp.
END IF.
exe.
!ENDDEFINE.

!calc 11**5.
***************************.

And now we have a our script that takes an expression and returns the answer. This isn’t great when the number of cases is humongous, as it still appears to cycle through all of the records in the dataset, but for most realisitic sized datasets this calculation will be instantaneous. For a test on 10 million cases, the result was returned in approximately two seconds on my current computer, but the execution of the command took another few seconds to cycle through the dataset.

Other problems with this I could see happening are you cannot directly control the precision with which the value is returned. It appears the temporary variable is returned as whatever the current default variable format is. Below is an example in syntax changing the default to return 5 decimal places.

***************************.
SET Format=F8.5.
!calc 3/10.
***************************.

Also as a note, you will need to have an active dataset with at least one case within it for this to work. Let me know in the comments if I’m crazy and there is an obviously easier way to do this.