Spineplots in SPSS

So the other day someone on cross-validated asked about visualizing categorical data, and spineplots was one of the responses. The OP asked if a solution in SPSS was available, and there is none currently available, with the exception of calling R code for Mosaic plots, which there is a handy function for that on developerworks. I had some code I started to attempt to make them, and it is good enough to show-case. Some notes on notation, these go by various other names (including Marimekko and Mosaic), also see this developerworkds thread which says spineplot but is something a bit different. Cox (2008) has a good discussion about the naming issues as well as examples, and Wickham & Hofmann (2011) have some more general discussion about different types of categorical plots and there relationships.

So instead of utilizing a regular stacked bar charts, spineplots make the width of the bar proportional to the size of the category. This makes categories with a larger share of the sample appear larger. Below is an example image from a recent thread on CV discussing various ways to plot categorical data.

This should be fairly intuitive what it represents. It is just a stacked bar chart, where the width of the bars on the X axis represent the marginal proportion of that category, and the height of the boxes on the Y axis represent the conditional proportion within each category (hence, all bars sum to a height of one).

Located here I have some of my example code to produce a similar plot all natively within SPSS. Directly below is an image of the result, and below that is an example of the syntax needed to generate the chart. In a nutshell, I provide a macro to generate the coordinates of the boxes and the labels. Then I just provide an example of how to generate the chart in GPL. The code currently sorts the boxes by the marginal totals on each axis, with the largest categories in the lower stack and to the left-most area of the chart. There is an optional parameter to turn this off though, in which case the sorting will be just by ascending order of however the categories are coded (the code has an example of this). I also provide an example at the end calling the R code to produce similar plots (but not shown here).

Caveats should be mentioned here as well, the code currently only works for two categorical variables, and the labels for the categories on the X-axis are labelled via data points within the chart. This will produce bad results with labels that are very close to one another (but at least you can edit/move them post-hoc in the chart editor in SPSS).

I asked Nick Cox on this question if his spineplot package for Stata had any sorting, and he replied in the negative. He has likely thought about it more than me, but I presume they should be sorted somehow by default, and sorting by the marginal totals in the categories was pretty easy to accomplish. I would like to dig into this (and other categorical data visualizations) a bit more, but unfortunately time is limited (and these don’t have much direct connection to my current scholarly work). There is a nice hodge-podge collection at the current question on CV I mentioned earlier (I think I need to add in a response about ParSets at the moment as well).



********************************************************************.
*Plots to make Mosaic Macro, tested on V20.
*I know for a fact V15 does not work, as it does not handle 
*coloring the boxes correctly when using the link.hull function.

*Change this to whereever you save the MosaicPlot macro.
FILE HANDLE data /name = "E:\Temp\MosaicPlot".
INSERT FILE = "data\MacroMosaic.sps".

*Making random categorical data.
dataset close ALL.
output close ALL.

set seed 14.
input program.
loop #i = 1 to 1000.
    compute DimStack = RV.BINOM(2,.6).
    compute DimCol = RV.BINOM(2,.7).
    end case.
end loop.
end file.
end input program.
dataset name cats.
exe.

value labels DimStack
0 'DimStack Cat0'
1 'DimStack Cat1'
2 'DimStack Cat2'.
value labels DimCol
0 'DimCol Cat0'
1 'DimCol Cat1'
2 'DimCol Cat2'.

*set mprint on.
!makespine Cat1 = DimStack Cat2 = DimCol.
*Example Graph - need to just replace Cat1 and Cat2 where appropriate.
dataset activate spinedata.
rename variables (DimStack = Cat1)(DimCol = Cat2).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X2 X1 Y1 Y2 myID Cat1 Cat2 Xmiddle
  MISSING = VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: Y2=col(source(s), name("Y2"))
 DATA: Y1=col(source(s), name("Y1"))
 DATA: X2=col(source(s), name("X2"))
 DATA: X1=col(source(s), name("X1"))
 DATA: Xmiddle=col(source(s), name("Xmiddle"))
 DATA: myID=col(source(s), name("myID"), unit.category())
 DATA: Cat1=col(source(s), name("Cat1"), unit.category())
 DATA: Cat2=col(source(s), name("Cat2"), unit.category())
 TRANS: y_temp = eval(1)
 SCALE: linear(dim(2), min(0), max(1.05))
 GUIDE: axis(dim(1), label("Prop. Cat 2"))
 GUIDE: axis(dim(2), label("Prop. Cat 1 within Cat 2"))
 ELEMENT: polygon(position(link.hull((X1 + X2)*(Y1 + Y2))), color.interior(Cat1), split(Cat2))
 ELEMENT: point(position(Xmiddle*y_temp), label(Cat2), transparency.exterior(transparency."1"))
END GPL.

*This makes the same chart without sorting.
dataset activate cats.
dataset close spinedata.
!makespine Cat1 = DimStack Cat2 = DimCol sort = N.
dataset activate spinedata.
rename variables (DimStack = Cat1)(DimCol = Cat2).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X2 X1 Y1 Y2 myID Cat1 Cat2 Xmiddle
  MISSING = VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: Y2=col(source(s), name("Y2"))
 DATA: Y1=col(source(s), name("Y1"))
 DATA: X2=col(source(s), name("X2"))
 DATA: X1=col(source(s), name("X1"))
 DATA: Xmiddle=col(source(s), name("Xmiddle"))
 DATA: myID=col(source(s), name("myID"), unit.category())
 DATA: Cat1=col(source(s), name("Cat1"), unit.category())
 DATA: Cat2=col(source(s), name("Cat2"), unit.category())
 TRANS: y_temp = eval(1)
 SCALE: linear(dim(2), min(0), max(1.05))
 GUIDE: axis(dim(1), label("Prop. Cat 2"))
 GUIDE: axis(dim(2), label("Prop. Cat 1 within Cat 2"))
 ELEMENT: polygon(position(link.hull((X1 + X2)*(Y1 + Y2))), color.interior(Cat1), split(Cat2))
 ELEMENT: point(position(Xmiddle*y_temp), label(Cat2), transparency.exterior(transparency."1"))
END GPL.
*In code online I have example using Mosaic plot plug in for R.
********************************************************************.

Citations of Interest

My posts on CrossValidated Blog

I’ve made several guest posts on the (now) currently dormat Cross Validated Community Blog. They are;

The notes on making tables is IMO my most useful collection, with the post on small multiples coming in second. Other contributions currently include;

For those not familiar, Cross Validated is a stackexchange website where one can ask and answer various questions related to statistics. They are a large improvement over list-serve emails, and I hope to promote their usefulness and encourage individuals to either contribute to current forums and/or start new ones for particular areas. I also particpate on the GIS and Academia sites (as well as programming questions for SPSS and R on stackoverflow).

The blog is just an extension of the site, in that Q/A sessions are not well suited for long discussions. So instead of fitting a square peg in a round hole at times, I believe the blog is a useful place for discussions and greater commentary useful for the communities that aren’t quite well placed in Q/A. Unfortunately, community up-take in the blog has been rather minor, so it is currently dormant. Feel free to stop by the Cross Validated chat room Ten fold if you are interested in contributing. I hope to see the blog not die, but IMO there isn’t much point in any of the current people to continue to contribute unless there becomes greater community contribution from other individuals.

Some notes on single line charts in SPSS

One thing you can’t do in legacy graph commands in SPSS is superimpose multiple elements on a single graph. One common instance in which I like doing this is to superimpose point observations on a low-frequency line chart. The data example I will use is the reported violent crime rate by the NYPD between 1985 and 2010, taken from the FBI UCR data tool. So below is an example line chart in SPSS.


data list free / year VCR.
begin data
1985    1881.3
1986    1995.2
1987    2036.1
1988    2217.6
1989    2299.9
1990    2383.6
1991    2318.2
1992    2163.7
1993    2089.8
1994    1860.9
1995    1557.8
1996    1344.2
1997    1268.4
1998    1167.4
1999    1062.6
2000    945.2
2001    927.5
2002    789.6
2003    734.1
2004    687.4
2005    673.1
2006    637.9
2007    613.8
2008    580.3
2009    551.8
2010    593.1
end data.
formats year VCR (F4.0)

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=year VCR
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: year=col(source(s), name("year"))
 DATA: VCR=col(source(s), name("VCR"))
 GUIDE: axis(dim(1), label("Year"))
 GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
 ELEMENT: line(position(year*VCR), color.interior(color.black))
 ELEMENT: point(position(year*VCR), color.interior(color.black), color.exterior(color.white), size(size."8px"))
END GPL.

This ends up being a pretty simple GPL call (at least relative to other inline GPL statements!). Besides nicely labelling the axis the only special things to note are

  • I drew the line element first.
  • I superimposed a point element on top of the line, filled black, with a white outline.

When you make multiple element calls in the GPL specification it acts just like drawing on a piece of paper, the elements that are listed first are drawn first, and elements listed later are drawn on top of those prior elements. I like doing the white outline for the superimposed points here because it creates further seperation from the line, but is not obtrusive enough to hinder general assessment of trends in the line.

To back up a bit, one of the reasons I like superimposing the observation points on a line like this is to show explicitly where the observations are on the chart. In these examples it isn’t as big a deal, as I don’t have missing data and the sampling is regular – but in cases in which those aren’t the case the line chart can be misleading. Both Kaiser Fung and Naomi Robbins have recent examples that illustrate this point. Although their examples are obviously better not connecting the lines at all, if I just had one or two years data missing it might be an ok assumption to just interpolate a line through that missing in this circumstance. Also in many instances lines are easier to assess general trends than bars and super-imposing multiple lines is frequently much better than making dodged bar graphs.

Another reason I like superimposing the points is because in areas of rapid change, the lines appear longer, but the sampling is the same. Superimposing the points reinforces the perception that the line is based on regularly sampled places on the X-axis.

Here I extend this code to further superimpose error intervals on the chart. This is a bit of a travesty for an example of time-series analysis (I just make prediction intervals from a regression on time, time squared and time cubed), but just go with it for the graphical presentation!


*Make a cubic function of time.
compute year_center = year - (1985 + 12.5).
compute year2 = year_center**2.
compute year3 = year_center**3.
*90% prediction interval.
REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10) CIN(90)
  /NOORIGIN
  /DEPENDENT VCR
  /METHOD=ENTER year_center year2 year3
  /SAVE ICIN .
formats LICI_1 UICI_1 (F4.0).
*Area difference chart.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=year VCR LICI_1 UICI_1
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: year=col(source(s), name("year"))
 DATA: VCR=col(source(s), name("VCR"))
 DATA: LICI_1=col(source(s), name("LICI_1"))
 DATA: UICI_1=col(source(s), name("UICI_1"))
 GUIDE: axis(dim(1), label("Year"))
 GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
 ELEMENT: area.difference(position(region.spread.range(year*(LICI_1 + UICI_1))), color.interior(color.grey), 
                  transparency.interior(transparency."0.5"))
 ELEMENT: line(position(year*VCR), color.interior(color.black))
 ELEMENT: point(position(year*VCR), color.interior(color.black), color.exterior(color.white), size(size."8px"))
END GPL.

This should be another good example where lines are an improvement over bars, I suspect it would be quite visually confusing to make such an error interval across a spectrum of bar charts. You could always do dynamite graphs, with error bars protruding from each bar, but that does not allow one to assess the general trend of the error intervals (and such dynamite charts shouldn’t be encouraged anyway).

My final example is using a polygon element to highlight an area of the chart. If you just want a single line, SPSS has the ability to either post-hoc edit a guideline into the graph, or you can specify the location of a guideline via GUIDE: form.line. What if you to highlight multiple years though – or just a range of values in general? You can superimpose a polygon element spanning the area of interest to do that. I saw a really nice example of this recently on the Rural Blog detailing Per-Capita sales before and after a Wal-Mart entered a community.

So here in this similar example I will highlight an area of a few years.


GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=year VCR
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: year=col(source(s), name("year"))
 DATA: VCR=col(source(s), name("VCR"))
 TRANS: begin=eval(1993)
 TRANS: end=eval(1996)
 TRANS: top= eval(3000)
 TRANS: bottom=eval(0)
 GUIDE: axis(dim(1), label("Year"))
 GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
 SCALE: linear(dim(2), min(500), max(2500))
 ELEMENT: polygon(position(link.hull((begin + end)*(bottom + top))), color.interior(color.blue), 
                  transparency.interior(transparency."0.5"))
 ELEMENT: line(position(year*VCR), color.interior(color.black))
 ELEMENT: point(position(year*VCR), color.interior(color.black), color.exterior(color.white), size(size."8px"))
END GPL.

For the polygon element, you first specify the outer coordinates through 4 TRANS commands, and then when making the GPL call you specify that the positions signify the convex hull of the polygon. The inner GPL statement of (begin + end)*(bottom + top) evaluates as the same to (begin*bottom + begin*top + end*bottom + end*top) because the graph alegebra is communative. The bottom and top you just need to pick to encapsulate some area outside of the visible max and min or the plot (and then further restrict the axis on the SCALE statment). Because the X axis is continuous, you could even make the encompassed area fractional units, to make it so the points of interest fall within the area. It should also be easy to see how to extend this to any arbitrary square within the plot.

In both examples with areas highlighted in the charts, I drew the areas first and semi-transparent. This allows one to see the gridlines underneath, and the areas don’t impede seeing the actual data contained in the line and points because it is beneath those other vector elements. The transparency is just a stylistic element I personally prefer in many circumstances, even if it isn’t needed to prevent multiple elements from being obsfucated by one another. In these examples I like that the gridlines aren’t hidden by the areas, but it is only a minor point.

Why I feel SPSS (or any statistical package) is better than Excel for this particular job

I debated on pulling an Andrew Gelman and adding a ps to my prior Junk Charts Challenge post, but it ended up being too verbose, so I just made an entirely new follow-up. To start, the discussion has currently evolved from this series of posts;

  • The original post on remaking a great line chart by Kaiser Fung, with the suggestion that the task (data manipulation and graphing) is easier in Excel.
  • My response on how to make the chart in SPSS.
  • Kaiser’s response to my post, in which I doubt I swayed his opinion on using Excel for this task! It appears to me based on the discussion so far the only real quarrel is whether the data manipulation is sufficiently complicated enough compared to the ease of pointing and clicking in Excel to justify using Excel. In SPSS to recreate Kaiser’s chart is does take some advanced knowledge of sorting and using lags to identify the pit and recoveries (the same logic could be extended to the data manipulations Kaiser says I skim over, as long as you can numerically or externally define what is a start of a recession).

All things considered for the internet, discussion has been pretty cordial so far. Although it is certainly sprinkled in my post, I didn’t mean for my post on SPSS to say that the task of grabbing data from online, manipulating it, and creating the graph was in any objective way easier in SPSS than in Excel. I realize pointing-and-clicking in Excel is easier for most, and only a few really adept at SPSS (like myself) would consider it easier in SPSS. I write quite a few tutorials on how to do things in SPSS, and that was one of the motivations for the tutorial. I want people using SPSS (or really any graphing software) to make nice graphs – and so if I think I can add value this way to the blogosphere I will! I hope my most value added is through SPSS tutorials, but I try to discuss general graphing concepts in the posts as well, so even for those not using SPSS it hopefully has some other useful content.

My original post wasn’t meant to discuss why I feel SPSS is a better job for this particular task, although it is certainly a reasonable question to ask (I tried to avoid it to prevent flame wars to be frank – but now I’ve stepped in it at this point it appears). As one of the comments on Kaiser’s follow up notes (and I agree), some tools are better for some jobs and we shouldn’t prefer one tool because of some sort of dogmatic allegiance. To make it clear though, and it was part of my motivation to write my initial response to the challenge post, I highly disagree that this particular task, which entails grabbing data from the internet, manipulating it, and creating a graph, and updating said graph on a monthly basis is better done in Excel. For a direct example of my non-allegiance to doing everything in SPSS for this job, I wouldn’t do the grabbing the data from the internet part in SPSS (indeed – it isn’t even directly possible unless you use Python code). Assuming it could be fully automated, I would write a custom SPSS job that manipulates the data after a wget command grabs the data, and have it all wrapped up in one bat file that runs on a monthly timer.

To go off on a slight tangent, why do I think I’m qualified to make such a distinction? Well, I use both SPSS and Excel on a regular basis. I wouldn’t consider myself a wiz at Excel nor VBA for Excel, but I have made custom Excel MACROS in the past to perform various jobs (make and format charts/tables etc.), and I have one task (a custom daily report of the crime incidents reported the previous day) I do on a daily basis at my job in Excel. So, FWIW, I feel reasonably qualified to make decisions on what tasks I should perform in which tools. So I’m giving my opinion, the same way Kaiser gave his initial opinion. I doubt my experience is as illustruous as Kaiser’s, but you can go to my CV page to see my current and prior work roles as an analyst. If I thought Excel, or Access, or R, or Python, or whatever was a better tool I would certainly personally use and suggest that. If you don’t have alittle trust in my opinion on such matters, well, you shouldn’t read what I write!

So, again to be clear, I feel this is a job better for SPSS (both the data manipulation and creating the graphics), although I admit it is initially harder to write the code to accomplish the task than pointing, clicking and going through chart wizards in Excel. So here I will try to articulate those reasons.

  • Any task I do on a regular basis, I want to be as automated as possible. Having to point-click, copy-paste on a regular basis invites both human error and is a waste of time. I don’t doubt you could fully (or very near) automate the task in Excel (as the comment on my blog post mentions). But this will ultimately involve scripting in VBA, which diminishes in any way that the Excel solution is easier than the SPSS solution.
  • The breadth of both data management capabilities, statistical analysis, and graphics are much larger in SPSS than in Excel. Consider the VBA code necessary to replicate my initial VARSTOCASES command in Excel, that is reshaping wide data to stacked long form. Consider the necessary VBA code to execute summary statistics over different groups without knowing what the different groups are beforehand. These are just a sampling of data management tools that are routine in statistics packages. In terms of charting, the most obvious function lacking in Excel is that it currently does not have facilities to make small-multiple charts (you can see some exceptional hacks from Jon Peltier, but those are certainly more limited in functionality that SPSS). Not mentioned (but most obvious) is the statistical capabilities of a statistical software!

So certainly, this particular job, could be done in Excel, as it does not require any functionality unique to a stats package. But why hamstring myself with these limitations from the onset? Frequently after I build custom, routine analysis like this I continually go back and provide more charts, so even if I have a good conceptualization of what I want to do at the onset there is no guarantee I won’t want to add this functionality in later. In terms of charting not having flexible small multiple charts is really a big deal, they can be used all the time.

Admittedly, this job is small enough in scope, if say the prior analyst was doing a regular updated chart via copy-paste like Kaiser is suggesting, I would consider just keeping that same format (it certainly is a lost opportunity cost to re-write the code in SPSS, and the fact that it is only on a monthly basis means to get time recovered if the task were fully automated would take quite some time). I just have personally enough experience in SPSS I know I could script a solution in SPSS quicker from the on-set than in Excel (I certainly can’t extrapolate that to anyone else though).

Part of both my preference and experience in SPSS comes from the jobs I personally have to do. For an example, I routinely pull a database of 500,000 incidents, do some data cleaning, and then merge this to a table of 300,000 charges and offenses and then merge to a second table of geocoded incident locations. Then using this data I routinely subset it, create aggregate summaries, tables, estimate various statistics and models, make some rudimentary maps, or even export the necessary data to import into a GIS software.

For arguments sake (with the exception of some of the more complicated data cleaning) this could be mostly done in SQL – but certainly no reasonable person should consider doing these multiple table merges and data cleaning in Excel (the nice interactive facilities with working with the spreadsheet in Excel are greatly dimished with any tables that take more a few scrolls to see). Statistical packages are really much more than tools to fit models, they are tools for working and manipulating data. I would highly recommend if you have to conduct routine tasks in which you manipulate data (something I assume most analysts have to do) you should consider learning statistical sofware, the same way I would recommend you should get to know SQL.

To be more balanced, here are things (knowing SPSS really well and Excel not as thoroughly) I think Excel excels at compared to SPSS;

  • Ease of making nicely formatted tables
  • Ease of directly interacting and editing components of charts and tables (this includes adding in supplementary vector graphics and labels).
  • Sparklines
  • Interactive Dashboards/Pivot Tables

Routine data management is not one of them, and only really sparklines and interactive dashboards are functionality in which I would prefer to make an end product in Excel over SPSS (and that doesn’t mean the whole workflow needs to be one software). I clean up ad-hoc tables for distribution in Excel all the time, because (as I said above) editing them in Excel is easier than editing them in SPSS. Again, my opinion, FWIW.

The Junk Charts Challenge: Remaking a great line chart in SPSS

I read and very much enjoy Kaiser Fung’s blog Junk Charts. One of the exchanges in the comments to the post, Remaking a great chart, Kaiser asserted it was easier to make the original chart in Excel than in any current programming language. I won’t deny it is easier to use a GUI dialog than learn some code, but here I will present how you would go about making the chart in SPSS’s grammar of graphics. The logic extends part-and-parcel to ggplot2.

The short answer is the data is originally in wide format, and most statistical packages it is only possible (or at least much easier) to make the chart when the data is in long format. This ends up being not a FAQ, but a frequent answer to different questions, so I hope going over such a task will have wider utility for alot of charting tasks.

So here is the original chart (originally from the Calculated Risk blog)

And here is Kaiser Fung’s updated version;

Within the article Kaiser states;

One thing you’ll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I’m not saying you can’t create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it’s done. But I find it surprising how much work it would be to use standard tools like R to do this.

Of course because anyone saavy with a statistical package would call bs (because it is), Kaiser gets some comments by more experienced R users saying so. Then Kaiser retorts in the comments with a question how to go about making the charts in R;

Hadley and Dean: I’m sure you’re better with R than most of us so I’d love to hear more. I have two separate issues with this task:

  1. assuming I know exactly the chart to build, and have all the right data elements, it is still much easier to use Excel than any coding language. This is true even if I have to update the chart month after month like CR blog has to. I see this as a challenge to those creating graphing software. (PS. Here, I’m thinking about the original CR version – I don’t think that one can easily make small multiples in Excel.)
  1. I don’t see a straightforward way to proceed in R (or other statistical languages) from grabbing the employment level data from the BLS website, and having the data formatted precisely for the chart I made. Perhaps one of you can give us some pseudo-code to walk through how you might do it. I think it’s easier to think about it than to actually do it.

So here I will show how one would go about making the charts in a statistical package, here SPSS. I actually don’t use the exact data to make the same chart, but there is very similar data at the Fed Bank of Minneapolis website. Here I utilize the table on cumulative decline of Non-Farm employment (seasonally adjusted) months after the NBER defined peak. I re-format the data so it can actually be read into a statistical package, and here is the xls data sheet. Also at that link the zip file contains all the SPSS code needed to reproduce the charts in this blogpost.

So first up, the data from the Fed Bank of Minneapolis website looks like approximately like this (in csv format);

MAP,Y1948,Y1953,Y1957,Y1960,Y1969,Y1973,Y1980,Y1981,Y1990,Y2001,Y2007
0,0,0,0,0,0,0,0,0,0,0,0
1,-0.4,-0.1,-0.4,-0.6,-0.1,0.2,0.1,0.0,-0.2,-0.2,0.0
2,-1.1,-0.3,-0.7,-0.8,0.1,0.3,0.2,-0.1,-0.3,-0.2,-0.1
3,-1.5,-0.6,-1.1,-0.9,0.3,0.4,0.1,-0.2,-0.4,-0.3,-0.1
4,-2.1,-1.2,-1.4,-1.0,0.2,0.5,-0.4,-0.5,-0.5,-0.4,-0.3

This isn’t my forte, so I’m unsure when Kaiser says grab the employment level data from the BLS website what exact data or table he is talking about. Regardless, if the table you grab the data from is in this wide format, it will be easier to make the charts we want if the data is in long format. So in the end if you want the data in long format, instead of every line being a different column, all the lines are in one column, like so;

MAP, YEAR, cdecline
0, 1948, 0
1, 1948, -.04
.
72, 1948, 8.2
0, 2007, 0
1, 2007, 0
.

So in SPSS, the steps would be like this to reshape the data (after reading in the data from my prepped xls file);

GET DATA /TYPE=XLS
 /FILE='data\historical_recessions_recoveries_data_03_08_2013.xls'
   /SHEET=name 'NonFarmEmploy'
   /CELLRANGE=full
   /READNAMES=on
   /ASSUMEDSTRWIDTH=32767.
DATASET NAME NonFarmEmploy.

*Reshape wide to long.
VARSTOCASES
/MAKE cdecline from Y1948 to Y2007
/INDEX year (cdecline).
compute year = REPLACE(year,"Y","").

This produces the data so instead of having seperate years in different variables, you have the cumulative decline in one column in the dataset, and another categorical variable identifying the year. Ok, so now we are ready to make a chart that replicates the original from the calculated risk blog. So here is the necessary code in SPSS to make a well formatted chart. Note the compute statement first makes a variable to flag if the year is 2007, which I then map to the aesthetics of red and larger size, so it comes to the foreground of the chart;

compute flag_2007 = (year = "2007").
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline flag_2007 year
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: year=col(source(s), name("year"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.grey), ("1", color.red)))
 SCALE: cat(aesthetic(aesthetic.size), map(("0",size."1px"), ("1",size."3.5px")))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(18))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: form.line(position(*,0), size(size."1px"), shape(shape.dash), color(color.black))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline), color(flag_2007), size(flag_2007), split(year))
END GPL.

Which produces this chart (ok I cheated alittle, I post-hoc added the labels in by hand in the SPSS editor, as I did not like the automatic label placement and it is easier to add in by hand than fix the automated labels). Also note this will appear slightly different than the default SPSS charts because I use my own personal chart template.

That is one hell of a chart command call though! You can actually produce most of the lines for this call through SPSS’s GUI dialog, and it just takes some more knowledge of the graphic language of SPSS to adjust the aesthetics of the chart. It would take a book to go through exactly how GPL works and the structure of the grammar, but here is an attempt at a more brief run down.

So typically, you would make seperate lines by specifiying that every year gets its own color. This is nearly impossible to distinguish between all of the lines though (as Kaiser originally states). A simple solution is to only highlight the line we are interested in, 2007, and make the rest of the lines the same color. To do this and still have the lines rendered seperately in SPSS’s GPL code, one need to specify the split modifier within the ELEMENT statement (the equivalent in ggplot2 is the group statement within aes). The things I manually edited differently than the original code generated through the GUI are;

  • Guide line at the zero value, and then making the guideline 1 point wide, black, and with a dashed pattern (GUIDE: form.line)
  • Color and size the 2007 line differently than the rest of the lines (SCALE: cat(aesthetic(aesthetic.color), map(("0", color.grey), ("1", color.red))))
  • Set the upper and lower boundary of the x and y axis (SCALE: linear(dim(2), min(-8), max(18)))
  • set the labels for the x and y axis, and set how often tick marks are generated (GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2)))
  • set the chart so the legend for the mapped aesthetics are not generated, because I manually label them anyway (GUIDE: legend(aesthetic(aesthetic.size), null()))

Technically, both in SPSS (and ggplot2) you could produce the chart in the original wide format, but this ends up being more code in the chart call (and grows with the numbers of groups) than simply reshaping the data so the data to makes the lines is in one column.

This chart, IMO, makes the point we want to make easily and succintly. The recession in 2007 has had a much harsher drop off in employment and has lasted much longer than employment figures in any recession since 1948. All of the further small multiples are superflous unless you really want to drill down into the differences between prior years, which are small in magnitude compared to the current recession. Using small lines and semi-transparency is the best way to plot many lines (and I wish people running regressions on panel data sets did it more often!)

So although that one graph call is complicated, it takes relatively few lines of code to read in the data and make it. In ggplot2 I’m pretty sure would be fewer lines (Hadley’s version of the grammar is much less verbose than SPSS). So, in code golf terms of complexity, we are doing alright. The power in programming though is it is trivial to reuse the code. So to make a paneled version similar to Kaiser’s remake we simply need to make the panel groupings, then copy-paste and slightly update the prior code to make a new chart;

compute #yearn = NUMBER(year,F4.0).
if RANGE(#yearn,1940,1959) = 1 decade = 1.
if RANGE(#yearn,1960,1979) = 1 decade = 2.
if RANGE(#yearn,1980,1999) = 1 decade = 3.
if RANGE(#yearn,2000,2019) = 1 decade = 4.
value labels decade
1 '1940s-50s'
2 '1960s-70s'
3 '1980s-90s'
4 '2000s'.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline year decade flag_2007
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: year=col(source(s), name("year"), unit.category())
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: decade=col(source(s), name("decade"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.black), ("1", color.red)))
 SCALE: cat(aesthetic(aesthetic.size), map(("0",size."1px"), ("1",size."3.5px")))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(18))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: axis(dim(4), opposite())
 GUIDE: form.line(position(*,0), size(size."0.5px"), shape(shape.dash), color(color.lightgrey))
 GUIDE: legend(aesthetic(aesthetic.color), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline*1*decade), color(flag_2007), size(flag_2007), split(year))
END GPL.

It should be easy to see comparing the new paneled chart syntax to the original, it only took two slight changes; 1) I needed to add in the new decade variable and define it in the DATA mapping, 2) I needed to add it to the ELEMENT call to produce paneling by row. Again I cheated alittle, I post hoc edited the grid lines out of the image, and changed the size of the Y axis labels. If I really wanted to automate these things in SPSS, I would need to rely on a custom template. In R in ggplot2, this is not necessary, as everything is exposed in the programming language. This is quite short work. Harder is to add in labels, I don’t bother here, but I would assume to do it nicely (if really needed) I would need to do it manually. I don’t bother here because it isn’t clear to me why I should care about which prior years are which.

On aesthetics, I would note Kaiser’s original panelled chart lacks distinction between the panels, which makes it easy to confuse Y axis values. I much prefer the default behavior of SPSS here. Also the default here does not look as nice in the original in terms of the X to Y axis ratio. This is because the panels make the charts Y axis shrink (but keep the X axis the same). My first chart I suspect looks nicer because it is closer to the Cleveland ideal of average 45 degree banking in the line slopes.

What about the data manipulation Kaiser suggests is difficult to conduct in a statistical programming language? Well, that is more difficult, but certainly not impossible (and certainly not faster in Excel to anyone who knows how to do it!) Here is how I would go about it in SPSS to identify the start, the trough, and the recovery.

*Small multiple chart in piecewise form, figure out start, min and then recovery.
compute flag = 0.
*Start.
if MAP = 0 flag = 1.
*Min.
sort cases by year cdecline.
do if year <> lag(year) or $casenum = 1.
    compute flag = 2.
    compute decline_MAP = MAP.
else if year = lag(year). 
    compute decline_MAP = lag(decline_MAP).
end if.
*Recovery.
*I need to know if it is after min to estimate this, some have a recovery before the
min otherwise.
sort cases by year MAP.
if lag(cdecline) < 0 and cdecline >= 0 and MAP > decline_MAP flag = 3.
if year = "2007" and MAP = 62 flag = 3.
exe.
*Now only select these cases.
dataset copy reduced.
dataset activate reduced.
select if flag > 0.

So another 16 lines (that aren’t comments) – what is this world of complex statistical programming coming too! If you want a run-down of how I am using lagged values to identify the places, see my recent post on sequential case processing.

Again, we can just copy and paste the chart syntax to produce the same chart with the reduced data. This time it is the exact same code as prior, so no updating needed.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline year decade flag_2007
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: year=col(source(s), name("year"), unit.category())
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: decade=col(source(s), name("decade"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.black), ("1", color.red)))
 SCALE: cat(aesthetic(aesthetic.size), map(("0",size."1px"), ("1",size."3.5px")))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(1))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: axis(dim(4), opposite())
 GUIDE: form.line(position(*,0), size(size."0.5px"), shape(shape.dash), color(color.lightgrey))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline*1*decade), color(flag_2007), size(flag_2007), split(year))
END GPL.

Again I lied a bit earlier, you really only needed 14 lines of code to produce the above chart. I actually spent a few saving to a new dataset. I wanted to see if the reduced summary in this dataset was an accurate representation. You can see it is except for years 73 and 80, in which they had slight positive recoveries before the bottoming out, so one bend in the curve doesn’t really cut it in those instances. Again, the chart only takes some slight editing in the GPL to produce. Here I produce a chart where each year has it’s own panel, and the panels are wrapped (instead of placed in new rows). This is useful when you have many panels.

compute reduced = 1.
dataset activate NonFarmEmploy.
compute reduced = 0.
add files file = *
/file = 'reduced'.
dataset close reduced.
value labels reduced
0 'Full Series'
1 'Kaisers Reduced Series'.

*for some reason, not letting me format labels for small multiples.
value labels year
'1948' "48"
'1953' "53"
'1957' "57"
'1960' "60"
'1969' "69"
'1973' "73"
'1980' "80"
'1981' "81"
'1990' "90"
'2001' "01"
'2007' "07".

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=MAP cdecline year flag_2007 reduced
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: MAP=col(source(s), name("MAP"))
 DATA: cdecline=col(source(s), name("cdecline"))
 DATA: year=col(source(s), name("year"), unit.category())
 DATA: flag_2007=col(source(s), name("flag_2007"), unit.category())
 DATA: reduced=col(source(s), name("reduced"), unit.category())
 COORD: rect(dim(1,2), wrap())
 SCALE: cat(aesthetic(aesthetic.color), map(("0", color.black), ("1", color.red)))
 SCALE: linear(dim(1), min(0), max(72))
 SCALE: linear(dim(2), min(-8), max(18))
 GUIDE: axis(dim(1), label("Months After Peak"), delta(6))
 GUIDE: axis(dim(2), label("Cum. Decline from NBER Peak"), delta(2))
 GUIDE: axis(dim(3), opposite())
 GUIDE: form.line(position(*,0), size(size."0.5px"), shape(shape.dash), color(color.lightgrey))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.size), null())
 ELEMENT: line(position(MAP*cdecline*year), color(reduced))
END GPL.

SPSS was misbehaving and labelling my years with a comma. To prevent that I made value labels with just the trailing two years. Again I post-hoc edited the size of the Y and X axis labels and manually removed the gridlines. Quite short work. Harder is to add in labels, I don’t bother here, but I would assume to do it nicely (if really needed) I would need to do it manually. I don’t bother here because it isn’t clear to me why I should care

As oppossed to going into a diatribe about the utility of learning a statistical programming language, I will just say that, if you are an analyst that works with data on a regular basis, you are doing yourself a disservice by only sticking to excel. Not only is the tool in large parts limited to the types of graphics and analysis one can conduct, it is very difficult to make tasks routine and reproducible.

Part of my dissapointment is that I highly suspect Kaiser has such programming experience, he just hasn’t taken the time to learn a statistical program thoroughly enough. I wouldn’t care, except that Kaiser is in a position of promoting best practices, and I would consider this to be one of them. I don’t deny that learning such programming languages is not easy, but as an analyst that works with data every day, I can tell you it is certainly worth the effort to learn a statistical programming language well.

Some random SPSS graph tips: shading areas under curves and using dodging in binned dot plots

This is just a quick post on some random graphing examples you can do with SPSS through inline GPL statements, but are not possible through the GUI dialog. These also take knowing alittle bit about the grammar of graphics, and the nuts and bolts of SPSS’s implementation. First up, shading under a curve.

Shading under a curve

I assume the motivation for doing this is obvious, but it is alittle advanced GPL to figure out how to accomplish. I swore someone asked how to do this the other day on NABBLE, but I could not find any such questions. Below is an example.

*****************************************.
input program.
loop #i = 1 to 2000.
compute X = (#i - 1000)/250.
compute PDF = PDF.NORMAL(X,0,1).
compute CDF = CDF.NORMAL(X,0,1).
end case.
end loop.
end file.
end input program.
dataset name sim.
exe.

formats PDF X (F2.1).

*area under entire curve.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PDF MISSING=LISTWISE
  REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: PDF=col(source(s), name("PDF"))
 GUIDE: axis(dim(1), label("X"))
 GUIDE: axis(dim(2), label("Prob. Dens."))
 ELEMENT: area(position(X*PDF), missing.wings())
END GPL.


*Mark off different areas.
compute tails = 0.
if CDF <= .025 tails = 1.
if CDF >= .975 tails = 2.
exe.

*Area with particular locations highlighted.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PDF tails 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: PDF=col(source(s), name("PDF"))
 DATA: tails=col(source(s), name("tails"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color.interior), map(("0",color.white),("1",color.grey),("2",color.grey)))
 SCALE: cat(aesthetic(aesthetic.transparency.interior), map(("0",transparency."1"),("1",transparency."0"),("2",transparency."0")))
 GUIDE: axis(dim(1), label("X"))
 GUIDE: axis(dim(2), label("Prob. Dens."))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.transparency.interior), null())
 ELEMENT: area(position(X*PDF), color.interior(tails), transparency.interior(tails))
END GPL.
*****************************************.

The area under the entire curve is pretty simple code, and can be accomplished through the GUI. The shading under different sections though requires a bit more thought. If you want both the upper and lower tails colored of the PDF, you need to specify seperate categories for them, otherwise they will connect at the bottom of the graph. Then you need to map the categories to specific colors, and if you want to be able to see the gridlines behind the central area you need to make the center area transparent. Note I also omit the legend, as I assume it will be obvious what the graph represents given other context or textual summaries.

Binning scale axis to produce dodging

The second example is based on the fact that for SPSS to utilize the dodge collision modifier, one needs a categorical axis. What if you want the axis to really be scale though? You can make the data categorical but the axis on a continuous scale by specifying a binned scale, but just make the binning small enough to suit your actual data values. This is easy to show with a categorical dot plot. If you can, IMO it is better to use dodging than jittering, and below is a perfect example. If you run the first GGRAPH statement, you will see the points aren’t dodged, although the graph is generated just fine and dandy with no error messages. The second graph bins the X variable (which is on the second dimension) with intervals of width 1. This ends up being exactly the same as the continuous axis, because the values are all positive integers anyway.

*****************************************.
set seed = 10.
input program.
loop #i = 1 to 1001.
compute X = TRUNC(RV.UNIFORM(0,101)).
compute cat = TRUNC(RV.UNIFORM(1,4)).
end case.
end loop.
end file.
end input program.
dataset name sim.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X cat
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: cat=col(source(s), name("cat"), unit.category())
 COORD: rect(dim(1,2))
 GUIDE: axis(dim(1), label("cat"))
 ELEMENT: point.dodge.symmetric(position(cat*X))
END GPL.

*Now lets try to bin X so the points actually dodge!.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X cat
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: cat=col(source(s), name("cat"), unit.category())
 COORD: rect(dim(1,2))
 GUIDE: axis(dim(1), label("cat"))
 ELEMENT: point.dodge.symmetric(position(bin.rect(cat*X, dim(2), binWidth(1))))
END GPL.
****************************************.

Both examples shown here only take slight alterations to code generatable through the GUI, but take a bit more understanding of the grammar to know how to accomplish (or even know they are possible). You unfortunately can’t implement Wilkinson’s (1999) true dot plot technique like this (he doesn’t suggest binning, but by choosing where the dots are placed by KDE estimation). But this should be sufficient for most circumstances.

Interval graph for viz. temporal overlap in crime events

I’ve currently made a vizualization intended to be an exploratory tool to identify overlapping criminal events over a short period. Code to reproduce the macro is here, and that includes an example with made up data. As opposed to aoristic analysis, which lets you see globally the aggregated summaries of when crime events occurred potentially correcting for the unspecified time period, this graphical procedure allows you to identify whether local events overlap. It also allows one to perceive global information as well, in particular whether the uncertainty of events occur in morning, afternoon, or night.

A call to the macro ends up looking like this (other vars is optional and can include multiple variables – I assume token names are self-explanatory);

!interval_data date_begin = datet_begin time_begin = XTIMEBEGIN date_end = datet_end time_end = XTIMEEND 
label_id = myid other_vars = crime rand.

This just produces a new dataset name interval_data, which can then be plotted. And here is the example graph that comes with the macro and its fake data (after I edited the chart slightly).

GGRAPH
  /GRAPHDATASET NAME="graphdataset" dataset = "interval_data" VARIABLES= LABELID DAY TIMEBEGIN TIMEEND MID WITHINCAT DAYWEEK
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: LABELID=col(source(s), name("LABELID"), unit.category())
 DATA: DAY=col(source(s), name("DAY"), unit.category())
 DATA: TIMEBEGIN=col(source(s), name("TIMEBEGIN"))
 DATA: TIMEEND=col(source(s), name("TIMEEND"))
 DATA: MID=col(source(s),name("MID"))
 DATA: WITHINCAT=col(source(s),name("WITHINCAT"), unit.category())
 DATA: DAYWEEK=col(source(s),name("DAYWEEK"), unit.category())
 COORD: rect(dim(1,2), cluster(3,0))
 SCALE: cat(dim(3), values("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23",
                                         "24","25","26","27","28","29","30","31"))
 ELEMENT: interval(position(region.spread.range(WITHINCAT*(TIMEBEGIN + TIMEEND)*DAY)), color.interior(DAYWEEK),
                             transparency.interior(transparency."0.1"), transparency.exterior(transparency."1")))
 ELEMENT: point(position(WITHINCAT*MID*DAY), color.interior(DAYWEEK), label(LABELID))
END GPL.

The chart can be interpreted as the days of the week are colored, and each interval represents on crime event, except when a crime events occurs over night, then the bar is split over two days (and each day the event is labeled). I wanted labels in the chart to easily reference specific events, and I assigned a point to the midpoint of the intervals to plot labels (also to give some visual girth to events that occurred over a short interval – otherwise they would be invisible in the chart). To displace the bars horizontally within the same day the chart essentially uses the same type of procedure that occurs in clustered bar charts.

GPL code can be inserted directly within macros, but it is quite a pain. It is better to use python to paramaterize GGRAPH, but I’m too lazy and don’t have python installed on my machine at work (ancient version of SPSS, V15, is to blame).

Here is another example with more of my data in the wild. This is for thefts from motor vehicles in Troy from the beginning of the year until 2/22. We had a bit of a rash over that time period, but they have since died down after the arrest of one particular prolific offender. This is evident in the chart,

We can also break down by other categories. This is what the token other_vars is for, it carries forward these other variables for use in facetting. For an example Troy has 4 police zones, and here is the graph broken down by each of them. Obviously crime sprees within short time frames are more likley perpetrated in close proximity. Also events committed by the same person are likely to re-occur within the same geographic proximity. The individual noted before was linked to events in Zone 4. I turn the labels off (it is pretty easy to toggle them in SPSS), and then one can either focus on individual events close in time or overlapping intervals pretty easily.

*Panelling by BEAT_DIST.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" dataset = "interval_data" VARIABLES= LABELID DAY TIMEBEGIN TIMEEND MID WITHINCAT DAYWEEK BEAT_DIST
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: LABELID=col(source(s), name("LABELID"), unit.category())
 DATA: DAY=col(source(s), name("DAY"), unit.category())
 DATA: TIMEBEGIN=col(source(s), name("TIMEBEGIN"))
 DATA: TIMEEND=col(source(s), name("TIMEEND"))
 DATA: MID=col(source(s),name("MID"))
 DATA: WITHINCAT=col(source(s),name("WITHINCAT"), unit.category())
 DATA: DAYWEEK=col(source(s),name("DAYWEEK"), unit.category())
 DATA: BEAT_DIST=col(source(s),name("BEAT_DIST"), unit.category())
 COORD: rect(dim(1,2), cluster(3,0))
 SCALE: cat(dim(3), values("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23",
                           "24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45",
                           "46"))
 ELEMENT: interval(position(region.spread.range(WITHINCAT*(TIMEBEGIN + TIMEEND)*DAY*1*BEAT_DIST)), color.interior(DAYWEEK),
                             transparency.interior(transparency."0.1"), transparency.exterior(transparency."1")))
 ELEMENT: point(position(WITHINCAT*MID*DAY*1*BEAT_DIST), color.interior(DAYWEEK), label(LABELID))
END GPL.

Note that to make the X axis insert all of the days I needed to include all of the numerical categories (between 1 and 46) in the value statement in the SCALE statement.

The chart in its current form can potentially be improved in a few ways, but I’ve had trouble accomplishing them so far. One is instead of utilizing clustering to displace the intervals, one could use dodging directly. I have yet to figure out how to specify the appropriate GPL code though when using dodging instead of clustering. Another is to utilize the dates directly, instead of any aribitrary categorical counter since the beginning of the series. To use dodging or clustering the axis needs to be categorical, so you could just make the axis the date, but bin the dates and specify the width of the bin categories to be 1 day (this would also avoid the annoying values statement to specify all the days). Again though, I was not able to figure out the correct GPL code to accomplish this.

I’d like to investigate ways to make this interactive and link with maps as well. I suspect it is possible in D3.js, and if I get around to figuring out how to make such a map/graphic duo I will for sure post it on the blog. In the meantime any thoughts or comments are always appreciated.

Aoristic analysis with SPSS

I’ve written a macro to conduct aoristic analysis with SPSS. Here I will briefly describe what it is, provide alternative references and demonstrate some of its utility on example data from Arlington PD.

In short, crime event data are frequently recorded as occurring within some indefinate time frame. For example, you may park your car and go to work at 08:00, and when you come back out at 04:30 on the same day to find your car window broken and your GPS stolen. Unless there happens to be other witnesses to the crime, you don’t know when the criminal event occurred besides between those two times. Where this is problematic for crime analysis is, you want to be able to look at the distribution of when events occurred, so as to give suggestions to understand why the event is occurring and how to potentially address it. Allocating patrols geographically and temporally to areas of high crime incidence has been regular practice for a long time (Wilson, 1963)! Aoristic analysis is simply a means to take into account that uncertainty of when the event occurred when we examine the overal incidence of crimes occurring across a set of times.

For a very brief illustrative example, lets say we want to know the number of crimes occuring for within the hours of 08:00, 09:00 and 10:00. If we had a criminal event that potentially occurred between 08:00 and 10:00, which is a total time span of 120 minutes (2 hours), instead of counting that event as occurring at 08:00 (the begin time), 10:00 (the end time) or 09:00 (the middle time), we spread the event out over the time frame, and only partially count it within any particular interval. So here it would count as a total of 0.50 weight in both the 08:00 and 09:00 category (60/120=0.25) and assign 0 weight in the 10:00 category (note the weights sum back to the value of 1). This just ends up being a way to estimate the incidence of some event within a given time bin knowing that the event did not necessarily occur in that time bin, so only partially counts towards the total in that bin (where partially is defined by how long the interval is and how much of that interval overlaps with the bin).

Here I illustrate the macro with some examples from the Arlington PD data downloaded on 1/11/2013. This is the only dataset I’ve found publicly available online that has both start and end dates for events (I first looked at NIBRS, and was slightly surprised that they did not have this information). I have the macro code, with examples therein of fake data and the same Arlington PD data, and compare them to this online calculator.

Note, my results will be slightly different than most other programs (including the online app I pointed to) because of one arbitrary (but what I feel is reasonable) coding decision. When an event is over the time period evaluated in the particular estimate (either days or week for my functions), it simply returns the event coded as having equal weight across the time period. Other’s don’t do this as far as I’m aware. So say an event takes place between 08:00 on 1/2/2013 and 10:00 on 1/3/2013. For my functions that only evaluate times over the day, I would return equal probability within every time slot, although some calculators would say there is a higher probability of occurring for times between 08:00 and 10:00 (because of the wrap-around). I believe this practice is a bit of a stretch, and any uncertainty over a day is essentially saying it is totally useless information to determine when during the time of day at all (although my week functions would be equivalent in that example). In those cases the begin and end times say more about when people check there cars, wake up in the morning, get home from work, come back from vacation etc. than they do about when the actual crime occurred.

Some examples

If you want to follow along right within SPSS, I suggest going to the google code site I’ve posted the code and data, but otherwise you can just take my word for it and see how the macro works in practice. I provide several seperate functions to either estimate the frequency of crimes occurring during 1 hour bins over the day, 15 minute bins over the day, days over the week, 1 hour bins over the week, or 15 minute bins over the week.

Here is an example call of the macro and the output from the 1 hour bins over the day with all crimes for the Arlington data.

!aoristic_day1hour begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2.

Date1 and Date2 are the begin and end dates respectively, Time1 and Time2 are the begin and end times respectively. Below is (close to) the automated graph the macro produces, which is just a line chart super-imposing the both the aoristic estimate and what the estimate would be if using the begin, end or middle time. The only differences are I post-hoc edited the aoristoc estimate line to be thicker and in the front (so it is more prominent) plus my personal chart template. Paramaterizing GGRAPH charts to work in macros is quite annoying, and python is a preferable solution (I’m personally happy with just a helper function to return the data in a nice format for me to generate the approprate GGRAPH code to generate the chart myself, there is more power in knowing the grammar than being complacent with the default chart).

 

 

So you can see here that overall, the aoristoc number does not make much of a difference. Here is the same info for the 15 minute bins across the week.

!aoristic_day15min begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2.
 

 

You can see here the aoristic estimate smooths out the data quite a bit more (which is nice above and beyond just worrying about whether one approach is correct or not. With the smaller time bins you can also see patterns to over-report incidents at natural times of hour and half-hour intervals. You can also see midnight, noon and 08:00 are aberently popular times to report either beginning or end times of incidents. You can spot a few other ones as well that differ between begin and end times, for instance it appears 07:00 is a popular end time but not so popular a begin time. The obverse is true for events in the middle afternoon, late evening and early night (e.g. the big spikes in the green begin time line for hours between 17:00 till midnight). Also note that it is near universal that crime dips to its lowest around 4~5 am, and you can see using either the aoristic estimate or the middle point of the event brings the number of events up during this period (as expected).

Also in the set of functions I have the capability to specify an arbitrary category to split the results by, and here is an example splitting the day of week aoristic estimate by the beat variable provided with the Arlington data. Again this isn’t the direct code, but a subsequent GGRAPH command to produce a nicer chart (the original is ok if you make it much bigger, but with so many categories facet wrapping is appropriate to save space).

!aoristic_week begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2 split = Beat.
 

 

The main thing that draws attention in this graph is the difference in levels of calls and different trends between beats (there were no obvious differences in the aoristic estimates versus the naive estimates, which is unsurprising since most incidents don’t have uncertainty of over a day). I know nothing about Arlington, and I don’t know where these beats are, so I can’t say anything about why these differences may potentially occur. In SPSS days start at Sunday (so Sunday = 1 and Saturday = 7). It isn’t that strange to expect slightly more crime on Fridays and Saturdays (people out and about doing things that make them more vulnerable), but for most of the beats showing a flat profile this is not strange either. But it is interesting to see Beat 260 have an atypical pattern of obviously more crimes during the week, and if I had to hazard a guess I would assume there is a middle or high school in Beat 260.

Although you could argue aoristic analysis is called for based purely on theoretical grounds, all these examples show events that it doesn’t make much of a difference whether you use it or simpler methods. Where it is likely to make the most difference though are events which have the longest unknown time intervals. Property crimes tend to be committed when the victim is not around, and so here I compare the aoristic estimate for 15 minute intervals over the day for burglaries, which will show an example where the aoristic estimate makes an actual difference!

 

 

One can see the property crimes have a larger difference for the aoristic estimate across the day, and it is largely flat compared to begin and end times. The end and begin times are likely biased to report when people discovered that the victimization occurred or when they last left their home vulnerable. There is some slight trend for more burglaries to occur during the daytime, and somewhat higher periods during the night (with lulls around 08:00 and 18:00. These are near the exact opposite conclusions you would make with utilizing the begin and end times as to when most burglaries occurred! Middle times results in some weird differences as well, with a high spike in the 01:00 to 04:00 range.

Some closing

This project kicked my butt alittle, and took much longer than I expected. Certainly the code could use improvement and re-factoring, but I’m glad it is done (and seemingly working). You will see there is certainly alot of redundancy between the functions, some temporary variables are computed multiple times, and the week long functions take a while to compute.

In the SPSS macro what I do in a nutshell is make a variable for every time bin, calculate the weight each case has for that time bin, then reshape the dataset wide to long, then at last aggregate the total weight within each time bin. Note this results in many more cases than the original data. For example, the Arlington data I will display later in the post has slightly over 49,000 incidents. For the 15 minute intervals per day (96), this results in over 4.7 million cases (49,000*96). For the 1 hour bins across the week, this results in n times 168 more cases, for the 15 minute bins across the week in n times 672! Subsequently those latter two take an appreciable amount of time to compute for larger datasets (if you don’t run out of local memory on your computer entirely, which I’m guessing could easily happen for some of the older systems and when you have upwards of probably 60,000 cases).

For those interested, the bottle-neck is obviously the VARSTOCASES procedure. But, I have a substantive reason for going through that step, and that is if one wants to use the original data weighted, for say kernel density maps sliced by time of day having the data in that format (long with a field identifying the factor) is more convienant than in the wide format. Thinking about it I could generate NULL data for 0 weight categories and then drop those cases during the VARSTOCASES, but it remains to be seen if that will have much of an appreciable effect on real world datasets. Hopefully in the near future I will get the time to provide examples of that (probably in R using facetting and small multiple maps). If anyone has improvements to the code feel free to send them to me (or just shoot me an email).

In the future I plan on talking about some more visualization techniques to explore crime data with intervals like this. In particular I have a plot manipulating the grammar of graphics a bit to produce a visualization of individual incidents, but it still needs some work and writing up into a nice function. Here is an example though.

 

 


Citations

Wilson, O.W. 1963. Police Administration. McGraw-Hill.

Ratcliffe, Jerry H. 2002. Aoristic signatures and the spatio-temporal analysis of high volume crime patterns. Journal of Quantitative Criminology 18(1): 23-43. PDF Here.

Some more about black backgrounds for maps

I am at it again discussing black map backgrounds. I make a set of crime maps for several local community groups as part of my job as a crime analyst for Troy PD. I tend to make several maps for each group, seperating out violent, property and quality of life related crimes. Within each map I try to attempt to make a hierarchy between crime types, with more serious crimes as larger markers and less severe crimes as smaller markers.

Despite critiques, I believe the dark background can be useful, as it creates greater contrast for map elements. In particular, the small crime dots are much easier to see (and IMO in these examples the streets and street name labels are still easy to read). Below are examples of the white background, a light grey background, and a black background for the same map (only changes are the black point marker is changed to white in the black background map, streets and parks are drawn with a heavy amount of transparency to begin with so don’t need to be changed).

Surprisingly to me, ink be damned, even printing out the black background looks pretty good! (I need to disseminate paper copies at these meetings) I think if I had to place the legend on the black map background I would be less thrilled, but currently I have half the page devoted to the map and the other half devoted to a table listing the events and the time they occurred, along with the legend (ditto for the scale bar and the North arrow not looking so nice).

I could probably manipulate the markers to provide more contrast in the white background map (e.g. make them bigger, draw the lighter/smaller symbols with dark outlines, etc.) But, I was quite happy with the black background map (and the grey background may be a useful in-between the two as well). It took no changes besides changing the background in my current template (and change black circles to white ones) to produce the example maps. I also chose those sizes for markers for a reason (so the map did not appear flooded with crime dots, and more severe and less severe crimes were easily distinguished), and so I’m hesistant to think that I can do much better than what I have so far with the white background maps (and I refuse to put those cheesy crime marker symbols, like a hand gun or a body outline, on my maps).

In terms of differentiating between global and local information in the maps, I believe the high contrast dark background map is nice to identify local points, but does not aid any in identifying general patterns. I don’t think general patterns are a real concern though for the local community groups (displaying so many points on the same map in general isn’t good for distinguishing general patterns anyway).

I’m a bit hesitant to roll out the black maps as of yet (maybe if I get some good feedback on this post I will be more daring). I’m still on the fence, but I may try out the grey background maps for the next round of monthly meetings. I will have to think if I can devise a reasonable experiment to differentiate between the maps and whether they meet the community groups goals and/or expectations. But, all together, the black background maps should certainly be given serious consideration for similar tasks. Again, as I said previously, the high contrast with smaller elements makes them more obvious (brings them more to the foreground) than with the white background, which as I show here can be useful in some circumstances.

Using the edge element in SPSS inline GPL with arbitrary X,Y coordinates

This post is intended to be a brief illustration of how one can utilize the edge element in SPSS’s GGRAPH statements to produce vector flow fields. I post this because the last time I consulted the GGRAPH manual (which admittedly is probably a few versions olds) they only have examples of SPSS’s ability to utilize random graph layouts for edge elements. Here I will show how to specify the algebra for edge elements if you already have X and Y coordinates for the beginning and end points for the edges. Also to note if you want arrow heads for line elements in your plots you need to utilize edge elements (the main original motivation for me to figure this out!)

So here is a simple example of utilizing coordinate begin and end points to specify arrows in a graph, and below the code is the image it produces.

DATA LIST FREE/ X1 X2 Y1 Y2 ID.
BEGIN DATA
1 3 2 4 1
1 3 4 3 2
END DATA.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X1 Y1 X2 Y2 ID MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X1=col(source(s), name("X1"))
  DATA: Y1=col(source(s), name("Y1"))
  DATA: X2=col(source(s), name("X2"))
  DATA: Y2=col(source(s), name("Y2"))
  DATA: ID=col(source(s), name("ID"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"))
  ELEMENT: edge(position((X1*Y1)+(X2*Y2)), shape(shape.arrow))
END GPL.

The magic happens in the call to the edge element, specifically the graph algebra position statement of (X1*Y1)+(X2*Y2). I haven’t read Wilkinson’s Grammar of Graphics, and I will admit with SPSS’s tutorial on graph algebra at the intro to the GPL manual it isn’t clear to me why this works. I believe the best answer I can say is that different elements have different styles to specify coordinates in the graph. For instance interval elements (e.g. bar charts) can take a location in one dimension and an interval in another dimension in the form of X*(Y1+Y2) (see an example in this post of mine on Nabble – I also just found in my notes an example of specifying an edge element in the similar manner to the interval element). This just happens to be a valid form to specify coordinates in edge elements when you aren’t using one of SPSS’s automatic graph layout rendering. I guess it is generically of the form FromNode(X*Y),ToNode(X*Y), but I haven’t seen any documentation for this and all of the examples I had seen in the reference guide utilize a different set of nodes and edges, and then specify a specific type of graph layout.

Here is another example visualizing a vector flow field. Eventually I would like to be able to superimpose such elements on a map – but that appears to not yet be possible in SPSS.

set seed = 10.
input program.
loop #i = 0 to 10.
loop #j = 0 to 10.
compute X = #i.
compute Y = #j.
*compute or_deg = RV.UNIFORM(0,360).
end case.
end loop.
end loop.
end file.
end input program.
dataset name sim.
execute.

*making orientation vary directly with X & Y attributes.

compute or_deg = 18*X + 18*Y.

*now to make the edge I would figure out the X & Y coordinate with alittle distance added (lets say .01) based on the orientation.
COMPUTE #pi=4*ARTAN(1).
compute or_rad = (or_deg/180)*#pi.
compute distance = .7.
execute.

compute X2 = X + sin(or_rad)*distance.
compute Y2 = Y + cos(or_rad)*distance.
execute.

DATASET ACTIVATE sim.
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y X2 Y2 or_deg MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: X2=col(source(s), name("X2"))
  DATA: Y2=col(source(s), name("Y2"))
  DATA: or_deg=col(source(s), name("or_deg"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"))
  ELEMENT: edge(position((X*Y)+(X2*Y2)), shape(shape.arrow))
END GPL.

You can then use other aesthetics in these charts same as usual (color, size, transparency, etc.)