Why I feel SPSS (or any statistical package) is better than Excel for this particular job

I debated on pulling an Andrew Gelman and adding a ps to my prior Junk Charts Challenge post, but it ended up being too verbose, so I just made an entirely new follow-up. To start, the discussion has currently evolved from this series of posts;

  • The original post on remaking a great line chart by Kaiser Fung, with the suggestion that the task (data manipulation and graphing) is easier in Excel.
  • My response on how to make the chart in SPSS.
  • Kaiser’s response to my post, in which I doubt I swayed his opinion on using Excel for this task! It appears to me based on the discussion so far the only real quarrel is whether the data manipulation is sufficiently complicated enough compared to the ease of pointing and clicking in Excel to justify using Excel. In SPSS to recreate Kaiser’s chart is does take some advanced knowledge of sorting and using lags to identify the pit and recoveries (the same logic could be extended to the data manipulations Kaiser says I skim over, as long as you can numerically or externally define what is a start of a recession).

All things considered for the internet, discussion has been pretty cordial so far. Although it is certainly sprinkled in my post, I didn’t mean for my post on SPSS to say that the task of grabbing data from online, manipulating it, and creating the graph was in any objective way easier in SPSS than in Excel. I realize pointing-and-clicking in Excel is easier for most, and only a few really adept at SPSS (like myself) would consider it easier in SPSS. I write quite a few tutorials on how to do things in SPSS, and that was one of the motivations for the tutorial. I want people using SPSS (or really any graphing software) to make nice graphs – and so if I think I can add value this way to the blogosphere I will! I hope my most value added is through SPSS tutorials, but I try to discuss general graphing concepts in the posts as well, so even for those not using SPSS it hopefully has some other useful content.

My original post wasn’t meant to discuss why I feel SPSS is a better job for this particular task, although it is certainly a reasonable question to ask (I tried to avoid it to prevent flame wars to be frank – but now I’ve stepped in it at this point it appears). As one of the comments on Kaiser’s follow up notes (and I agree), some tools are better for some jobs and we shouldn’t prefer one tool because of some sort of dogmatic allegiance. To make it clear though, and it was part of my motivation to write my initial response to the challenge post, I highly disagree that this particular task, which entails grabbing data from the internet, manipulating it, and creating a graph, and updating said graph on a monthly basis is better done in Excel. For a direct example of my non-allegiance to doing everything in SPSS for this job, I wouldn’t do the grabbing the data from the internet part in SPSS (indeed – it isn’t even directly possible unless you use Python code). Assuming it could be fully automated, I would write a custom SPSS job that manipulates the data after a wget command grabs the data, and have it all wrapped up in one bat file that runs on a monthly timer.

To go off on a slight tangent, why do I think I’m qualified to make such a distinction? Well, I use both SPSS and Excel on a regular basis. I wouldn’t consider myself a wiz at Excel nor VBA for Excel, but I have made custom Excel MACROS in the past to perform various jobs (make and format charts/tables etc.), and I have one task (a custom daily report of the crime incidents reported the previous day) I do on a daily basis at my job in Excel. So, FWIW, I feel reasonably qualified to make decisions on what tasks I should perform in which tools. So I’m giving my opinion, the same way Kaiser gave his initial opinion. I doubt my experience is as illustruous as Kaiser’s, but you can go to my CV page to see my current and prior work roles as an analyst. If I thought Excel, or Access, or R, or Python, or whatever was a better tool I would certainly personally use and suggest that. If you don’t have alittle trust in my opinion on such matters, well, you shouldn’t read what I write!

So, again to be clear, I feel this is a job better for SPSS (both the data manipulation and creating the graphics), although I admit it is initially harder to write the code to accomplish the task than pointing, clicking and going through chart wizards in Excel. So here I will try to articulate those reasons.

  • Any task I do on a regular basis, I want to be as automated as possible. Having to point-click, copy-paste on a regular basis invites both human error and is a waste of time. I don’t doubt you could fully (or very near) automate the task in Excel (as the comment on my blog post mentions). But this will ultimately involve scripting in VBA, which diminishes in any way that the Excel solution is easier than the SPSS solution.
  • The breadth of both data management capabilities, statistical analysis, and graphics are much larger in SPSS than in Excel. Consider the VBA code necessary to replicate my initial VARSTOCASES command in Excel, that is reshaping wide data to stacked long form. Consider the necessary VBA code to execute summary statistics over different groups without knowing what the different groups are beforehand. These are just a sampling of data management tools that are routine in statistics packages. In terms of charting, the most obvious function lacking in Excel is that it currently does not have facilities to make small-multiple charts (you can see some exceptional hacks from Jon Peltier, but those are certainly more limited in functionality that SPSS). Not mentioned (but most obvious) is the statistical capabilities of a statistical software!

So certainly, this particular job, could be done in Excel, as it does not require any functionality unique to a stats package. But why hamstring myself with these limitations from the onset? Frequently after I build custom, routine analysis like this I continually go back and provide more charts, so even if I have a good conceptualization of what I want to do at the onset there is no guarantee I won’t want to add this functionality in later. In terms of charting not having flexible small multiple charts is really a big deal, they can be used all the time.

Admittedly, this job is small enough in scope, if say the prior analyst was doing a regular updated chart via copy-paste like Kaiser is suggesting, I would consider just keeping that same format (it certainly is a lost opportunity cost to re-write the code in SPSS, and the fact that it is only on a monthly basis means to get time recovered if the task were fully automated would take quite some time). I just have personally enough experience in SPSS I know I could script a solution in SPSS quicker from the on-set than in Excel (I certainly can’t extrapolate that to anyone else though).

Part of both my preference and experience in SPSS comes from the jobs I personally have to do. For an example, I routinely pull a database of 500,000 incidents, do some data cleaning, and then merge this to a table of 300,000 charges and offenses and then merge to a second table of geocoded incident locations. Then using this data I routinely subset it, create aggregate summaries, tables, estimate various statistics and models, make some rudimentary maps, or even export the necessary data to import into a GIS software.

For arguments sake (with the exception of some of the more complicated data cleaning) this could be mostly done in SQL – but certainly no reasonable person should consider doing these multiple table merges and data cleaning in Excel (the nice interactive facilities with working with the spreadsheet in Excel are greatly dimished with any tables that take more a few scrolls to see). Statistical packages are really much more than tools to fit models, they are tools for working and manipulating data. I would highly recommend if you have to conduct routine tasks in which you manipulate data (something I assume most analysts have to do) you should consider learning statistical sofware, the same way I would recommend you should get to know SQL.

To be more balanced, here are things (knowing SPSS really well and Excel not as thoroughly) I think Excel excels at compared to SPSS;

  • Ease of making nicely formatted tables
  • Ease of directly interacting and editing components of charts and tables (this includes adding in supplementary vector graphics and labels).
  • Sparklines
  • Interactive Dashboards/Pivot Tables

Routine data management is not one of them, and only really sparklines and interactive dashboards are functionality in which I would prefer to make an end product in Excel over SPSS (and that doesn’t mean the whole workflow needs to be one software). I clean up ad-hoc tables for distribution in Excel all the time, because (as I said above) editing them in Excel is easier than editing them in SPSS. Again, my opinion, FWIW.

Update for Aoristic Macro in SPSS

I’ve substantially updated the aoristic macro for SPSS from what I previously posted. The updated code can be found here. The improvements are;

  • Code is much more modularized, it is only 1 function and takes an Interval parameter to determine what interval summaries you want.
  • It includes Agresti-Coull binomial error intervals (95% Confidence Intervals). It also returns a percentage estimate and the total number of cases the estimate is based off of, besides the usual info for time period, split file, and the absolute aoristic estimate.
  • allows an optional command to save the reshaped long dataset

Functionality dropped are default plots, and saving of begin, end and middle times for the same estimates. I just didn’t find these useful (besides academic purposes).

The main motivation was to add in error bars, as I found when I was making many of these charts it was obvious that some of the estimates were highly variable. While the Agresti-Coull binomial proportions are not entirely justified in this novel circumstance, they are better than nothing to at least illustrate the error in the estimates (it seems to me that they will likely be too small if anything, but I’m not sure).

I think a good paper I might work on in the future when I get a chance to is 1) show how variable the estimates are in small samples, and 2) evaluate the asympotic coverages of various estimators (traditional binomial proportions vs. bootstrap I suppose). Below is an example output of the updated macro, again with the same data I used previously. I make the small multiple chart by different crime types to show the variability in the estimates for given sample sizes.

Some random SPSS graph tips: shading areas under curves and using dodging in binned dot plots

This is just a quick post on some random graphing examples you can do with SPSS through inline GPL statements, but are not possible through the GUI dialog. These also take knowing alittle bit about the grammar of graphics, and the nuts and bolts of SPSS’s implementation. First up, shading under a curve.

Shading under a curve

I assume the motivation for doing this is obvious, but it is alittle advanced GPL to figure out how to accomplish. I swore someone asked how to do this the other day on NABBLE, but I could not find any such questions. Below is an example.

*****************************************.
input program.
loop #i = 1 to 2000.
compute X = (#i - 1000)/250.
compute PDF = PDF.NORMAL(X,0,1).
compute CDF = CDF.NORMAL(X,0,1).
end case.
end loop.
end file.
end input program.
dataset name sim.
exe.

formats PDF X (F2.1).

*area under entire curve.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PDF MISSING=LISTWISE
  REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: PDF=col(source(s), name("PDF"))
 GUIDE: axis(dim(1), label("X"))
 GUIDE: axis(dim(2), label("Prob. Dens."))
 ELEMENT: area(position(X*PDF), missing.wings())
END GPL.


*Mark off different areas.
compute tails = 0.
if CDF <= .025 tails = 1.
if CDF >= .975 tails = 2.
exe.

*Area with particular locations highlighted.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PDF tails 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: PDF=col(source(s), name("PDF"))
 DATA: tails=col(source(s), name("tails"), unit.category())
 SCALE: cat(aesthetic(aesthetic.color.interior), map(("0",color.white),("1",color.grey),("2",color.grey)))
 SCALE: cat(aesthetic(aesthetic.transparency.interior), map(("0",transparency."1"),("1",transparency."0"),("2",transparency."0")))
 GUIDE: axis(dim(1), label("X"))
 GUIDE: axis(dim(2), label("Prob. Dens."))
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 GUIDE: legend(aesthetic(aesthetic.transparency.interior), null())
 ELEMENT: area(position(X*PDF), color.interior(tails), transparency.interior(tails))
END GPL.
*****************************************.

The area under the entire curve is pretty simple code, and can be accomplished through the GUI. The shading under different sections though requires a bit more thought. If you want both the upper and lower tails colored of the PDF, you need to specify seperate categories for them, otherwise they will connect at the bottom of the graph. Then you need to map the categories to specific colors, and if you want to be able to see the gridlines behind the central area you need to make the center area transparent. Note I also omit the legend, as I assume it will be obvious what the graph represents given other context or textual summaries.

Binning scale axis to produce dodging

The second example is based on the fact that for SPSS to utilize the dodge collision modifier, one needs a categorical axis. What if you want the axis to really be scale though? You can make the data categorical but the axis on a continuous scale by specifying a binned scale, but just make the binning small enough to suit your actual data values. This is easy to show with a categorical dot plot. If you can, IMO it is better to use dodging than jittering, and below is a perfect example. If you run the first GGRAPH statement, you will see the points aren’t dodged, although the graph is generated just fine and dandy with no error messages. The second graph bins the X variable (which is on the second dimension) with intervals of width 1. This ends up being exactly the same as the continuous axis, because the values are all positive integers anyway.

*****************************************.
set seed = 10.
input program.
loop #i = 1 to 1001.
compute X = TRUNC(RV.UNIFORM(0,101)).
compute cat = TRUNC(RV.UNIFORM(1,4)).
end case.
end loop.
end file.
end input program.
dataset name sim.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X cat
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: cat=col(source(s), name("cat"), unit.category())
 COORD: rect(dim(1,2))
 GUIDE: axis(dim(1), label("cat"))
 ELEMENT: point.dodge.symmetric(position(cat*X))
END GPL.

*Now lets try to bin X so the points actually dodge!.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X cat
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: X=col(source(s), name("X"))
 DATA: cat=col(source(s), name("cat"), unit.category())
 COORD: rect(dim(1,2))
 GUIDE: axis(dim(1), label("cat"))
 ELEMENT: point.dodge.symmetric(position(bin.rect(cat*X, dim(2), binWidth(1))))
END GPL.
****************************************.

Both examples shown here only take slight alterations to code generatable through the GUI, but take a bit more understanding of the grammar to know how to accomplish (or even know they are possible). You unfortunately can’t implement Wilkinson’s (1999) true dot plot technique like this (he doesn’t suggest binning, but by choosing where the dots are placed by KDE estimation). But this should be sufficient for most circumstances.

Interval graph for viz. temporal overlap in crime events

I’ve currently made a vizualization intended to be an exploratory tool to identify overlapping criminal events over a short period. Code to reproduce the macro is here, and that includes an example with made up data. As opposed to aoristic analysis, which lets you see globally the aggregated summaries of when crime events occurred potentially correcting for the unspecified time period, this graphical procedure allows you to identify whether local events overlap. It also allows one to perceive global information as well, in particular whether the uncertainty of events occur in morning, afternoon, or night.

A call to the macro ends up looking like this (other vars is optional and can include multiple variables – I assume token names are self-explanatory);

!interval_data date_begin = datet_begin time_begin = XTIMEBEGIN date_end = datet_end time_end = XTIMEEND 
label_id = myid other_vars = crime rand.

This just produces a new dataset name interval_data, which can then be plotted. And here is the example graph that comes with the macro and its fake data (after I edited the chart slightly).

GGRAPH
  /GRAPHDATASET NAME="graphdataset" dataset = "interval_data" VARIABLES= LABELID DAY TIMEBEGIN TIMEEND MID WITHINCAT DAYWEEK
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: LABELID=col(source(s), name("LABELID"), unit.category())
 DATA: DAY=col(source(s), name("DAY"), unit.category())
 DATA: TIMEBEGIN=col(source(s), name("TIMEBEGIN"))
 DATA: TIMEEND=col(source(s), name("TIMEEND"))
 DATA: MID=col(source(s),name("MID"))
 DATA: WITHINCAT=col(source(s),name("WITHINCAT"), unit.category())
 DATA: DAYWEEK=col(source(s),name("DAYWEEK"), unit.category())
 COORD: rect(dim(1,2), cluster(3,0))
 SCALE: cat(dim(3), values("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23",
                                         "24","25","26","27","28","29","30","31"))
 ELEMENT: interval(position(region.spread.range(WITHINCAT*(TIMEBEGIN + TIMEEND)*DAY)), color.interior(DAYWEEK),
                             transparency.interior(transparency."0.1"), transparency.exterior(transparency."1")))
 ELEMENT: point(position(WITHINCAT*MID*DAY), color.interior(DAYWEEK), label(LABELID))
END GPL.

The chart can be interpreted as the days of the week are colored, and each interval represents on crime event, except when a crime events occurs over night, then the bar is split over two days (and each day the event is labeled). I wanted labels in the chart to easily reference specific events, and I assigned a point to the midpoint of the intervals to plot labels (also to give some visual girth to events that occurred over a short interval – otherwise they would be invisible in the chart). To displace the bars horizontally within the same day the chart essentially uses the same type of procedure that occurs in clustered bar charts.

GPL code can be inserted directly within macros, but it is quite a pain. It is better to use python to paramaterize GGRAPH, but I’m too lazy and don’t have python installed on my machine at work (ancient version of SPSS, V15, is to blame).

Here is another example with more of my data in the wild. This is for thefts from motor vehicles in Troy from the beginning of the year until 2/22. We had a bit of a rash over that time period, but they have since died down after the arrest of one particular prolific offender. This is evident in the chart,

We can also break down by other categories. This is what the token other_vars is for, it carries forward these other variables for use in facetting. For an example Troy has 4 police zones, and here is the graph broken down by each of them. Obviously crime sprees within short time frames are more likley perpetrated in close proximity. Also events committed by the same person are likely to re-occur within the same geographic proximity. The individual noted before was linked to events in Zone 4. I turn the labels off (it is pretty easy to toggle them in SPSS), and then one can either focus on individual events close in time or overlapping intervals pretty easily.

*Panelling by BEAT_DIST.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" dataset = "interval_data" VARIABLES= LABELID DAY TIMEBEGIN TIMEEND MID WITHINCAT DAYWEEK BEAT_DIST
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: s=userSource(id("graphdataset"))
 DATA: LABELID=col(source(s), name("LABELID"), unit.category())
 DATA: DAY=col(source(s), name("DAY"), unit.category())
 DATA: TIMEBEGIN=col(source(s), name("TIMEBEGIN"))
 DATA: TIMEEND=col(source(s), name("TIMEEND"))
 DATA: MID=col(source(s),name("MID"))
 DATA: WITHINCAT=col(source(s),name("WITHINCAT"), unit.category())
 DATA: DAYWEEK=col(source(s),name("DAYWEEK"), unit.category())
 DATA: BEAT_DIST=col(source(s),name("BEAT_DIST"), unit.category())
 COORD: rect(dim(1,2), cluster(3,0))
 SCALE: cat(dim(3), values("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23",
                           "24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45",
                           "46"))
 ELEMENT: interval(position(region.spread.range(WITHINCAT*(TIMEBEGIN + TIMEEND)*DAY*1*BEAT_DIST)), color.interior(DAYWEEK),
                             transparency.interior(transparency."0.1"), transparency.exterior(transparency."1")))
 ELEMENT: point(position(WITHINCAT*MID*DAY*1*BEAT_DIST), color.interior(DAYWEEK), label(LABELID))
END GPL.

Note that to make the X axis insert all of the days I needed to include all of the numerical categories (between 1 and 46) in the value statement in the SCALE statement.

The chart in its current form can potentially be improved in a few ways, but I’ve had trouble accomplishing them so far. One is instead of utilizing clustering to displace the intervals, one could use dodging directly. I have yet to figure out how to specify the appropriate GPL code though when using dodging instead of clustering. Another is to utilize the dates directly, instead of any aribitrary categorical counter since the beginning of the series. To use dodging or clustering the axis needs to be categorical, so you could just make the axis the date, but bin the dates and specify the width of the bin categories to be 1 day (this would also avoid the annoying values statement to specify all the days). Again though, I was not able to figure out the correct GPL code to accomplish this.

I’d like to investigate ways to make this interactive and link with maps as well. I suspect it is possible in D3.js, and if I get around to figuring out how to make such a map/graphic duo I will for sure post it on the blog. In the meantime any thoughts or comments are always appreciated.

Using sequential case processing for data management in SPSS

SPSS when making calculations essentially loops through every variable sequentially. So although calculations in syntax are always vectorized (the exception being explicit loops in MATRIX commands), that is compute y = x - 5. works over the entire x vector without specifying it, it really is just doing a loop through all of the records in the data set and calculating the value of y in one row at a time.

We can use this to our advantage though in a variety of data management tasks in conjunction with using lagged values in the data matrix. Let’s consider making a counter variable within a set of ID’s. Consider the example dataset below;

data list free /id value.
begin data
1 10
1 11
1 14
1 13
2 12
2 90
2 16
2 14
3 12
3 8
3 17
3 22
end data.
dataset name seq.

To make a counter for the entire dataset, it would be as simple as using the system variable $casenum, but what about a counter variable within each unique id value? Well we can use SPSS’s sequential case processing and LAG to do that for us. For example (note that this assumes the variables are already sorted so the id’s are in sequential order in the dataset);

DO IF id <> LAG(id) or MISSING(LAG(id)) = 1.
    COMPUTE counter_id = 1.
ELSE IF id = LAG(id).
    COMPUTE counter_id = lag(counter_id) + 1.
END IF.

The first if statement evalutes if the previous id value is the same, and if it is different (or missing, which is for the first row in the dataset) starts the counter at 1. If the lagged id value is the same, it increases the counter by 1. It should be clear how this can be used to identify duplicate values as well. Although the MATCH FILES command can frequently be more economical, it is pretty easy using sort. For instance, lets say in the previous example I wanted to only have one id per row in the dataset (e.g. eliminate duplicate id’s), but I wanted to only keep the highest value within id. This can be done just by sorting the dataset in a particular way (so the id with the highest value is always at the top of the list of sequential id’s).

SORT CASES BY id (A) value (D).
COMPUTE dup = 0.
IF id = lag(id) dup = 1.
SELECT IF dup = 0.

The equivalent expression using match files would be (note the reversal of dup in the two expressions, in match files I want to select the 1 value).

SORT CASES BY id (A) value (D).
MATCH FILES file = *
/first = dup
/by id.
SELECT IF dup = 1.

The match files approach scales better to more variables. If I had two variables I would need to write IF id1 = lag(id1) and id2 = lag(id2) dup = 1. with the lag approach, but only need to write /by id1 id2. for the match files approach. Again this particular example can be trivially done with another command (AGGREGATE in this instance), but the main difference is the two approaches above keep all of the variables in the current data set, and this needs to be explicitly written on the AGGREGATE command.

DATASET ACTIVATE seq.
DATASET DECLARE agg_seq.
AGGREGATE
  /OUTFILE='agg_seq'
  /BREAK=id
  /value=MAX(value).
dataset close seq.
dataset activate agg_seq.

One may think that the sequential case processing is not that helpful, as I’ve shown some alternative ways to do the same thing. But consider a case where you want to propogate down values, this can’t be done directly via match files or aggregate. For instance, I’ve done some text munging of tables exported from PDF files that look approximately like this when reading into an SPSS data file (where I use periods to symbolize missing data);

data list free /table_ID (F1.0) row_name col1 col2 (3A5).
begin data
1 . Col1 Col2 
. Row1 4 6
. Row2 8 20
2 . Col1 Col2
. Row1 5 10
. Row2 15 20
end data.
dataset name tables.

Any more useful representation of the data would need to associate particular rows with which table it came from. Here is sequential case processing to the rescue;

if MISSING(table_ID) = 1 table_ID = lag(table_ID).

Very simple fix, but perhaps not intuitive without munging around in SPSS for awhile. For another simple application of this see this NABBLE discussion where I give an example of propogating down and concatenating multiple string values). Another (more elaborate) example of this can be seen when merging and sorting a database of ranges to a number within the range.

This is what I would consider an advanced data management tool, and one that I use on a regular basis.

Aoristic analysis with SPSS

I’ve written a macro to conduct aoristic analysis with SPSS. Here I will briefly describe what it is, provide alternative references and demonstrate some of its utility on example data from Arlington PD.

In short, crime event data are frequently recorded as occurring within some indefinate time frame. For example, you may park your car and go to work at 08:00, and when you come back out at 04:30 on the same day to find your car window broken and your GPS stolen. Unless there happens to be other witnesses to the crime, you don’t know when the criminal event occurred besides between those two times. Where this is problematic for crime analysis is, you want to be able to look at the distribution of when events occurred, so as to give suggestions to understand why the event is occurring and how to potentially address it. Allocating patrols geographically and temporally to areas of high crime incidence has been regular practice for a long time (Wilson, 1963)! Aoristic analysis is simply a means to take into account that uncertainty of when the event occurred when we examine the overal incidence of crimes occurring across a set of times.

For a very brief illustrative example, lets say we want to know the number of crimes occuring for within the hours of 08:00, 09:00 and 10:00. If we had a criminal event that potentially occurred between 08:00 and 10:00, which is a total time span of 120 minutes (2 hours), instead of counting that event as occurring at 08:00 (the begin time), 10:00 (the end time) or 09:00 (the middle time), we spread the event out over the time frame, and only partially count it within any particular interval. So here it would count as a total of 0.50 weight in both the 08:00 and 09:00 category (60/120=0.25) and assign 0 weight in the 10:00 category (note the weights sum back to the value of 1). This just ends up being a way to estimate the incidence of some event within a given time bin knowing that the event did not necessarily occur in that time bin, so only partially counts towards the total in that bin (where partially is defined by how long the interval is and how much of that interval overlaps with the bin).

Here I illustrate the macro with some examples from the Arlington PD data downloaded on 1/11/2013. This is the only dataset I’ve found publicly available online that has both start and end dates for events (I first looked at NIBRS, and was slightly surprised that they did not have this information). I have the macro code, with examples therein of fake data and the same Arlington PD data, and compare them to this online calculator.

Note, my results will be slightly different than most other programs (including the online app I pointed to) because of one arbitrary (but what I feel is reasonable) coding decision. When an event is over the time period evaluated in the particular estimate (either days or week for my functions), it simply returns the event coded as having equal weight across the time period. Other’s don’t do this as far as I’m aware. So say an event takes place between 08:00 on 1/2/2013 and 10:00 on 1/3/2013. For my functions that only evaluate times over the day, I would return equal probability within every time slot, although some calculators would say there is a higher probability of occurring for times between 08:00 and 10:00 (because of the wrap-around). I believe this practice is a bit of a stretch, and any uncertainty over a day is essentially saying it is totally useless information to determine when during the time of day at all (although my week functions would be equivalent in that example). In those cases the begin and end times say more about when people check there cars, wake up in the morning, get home from work, come back from vacation etc. than they do about when the actual crime occurred.

Some examples

If you want to follow along right within SPSS, I suggest going to the google code site I’ve posted the code and data, but otherwise you can just take my word for it and see how the macro works in practice. I provide several seperate functions to either estimate the frequency of crimes occurring during 1 hour bins over the day, 15 minute bins over the day, days over the week, 1 hour bins over the week, or 15 minute bins over the week.

Here is an example call of the macro and the output from the 1 hour bins over the day with all crimes for the Arlington data.

!aoristic_day1hour begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2.

Date1 and Date2 are the begin and end dates respectively, Time1 and Time2 are the begin and end times respectively. Below is (close to) the automated graph the macro produces, which is just a line chart super-imposing the both the aoristic estimate and what the estimate would be if using the begin, end or middle time. The only differences are I post-hoc edited the aoristoc estimate line to be thicker and in the front (so it is more prominent) plus my personal chart template. Paramaterizing GGRAPH charts to work in macros is quite annoying, and python is a preferable solution (I’m personally happy with just a helper function to return the data in a nice format for me to generate the approprate GGRAPH code to generate the chart myself, there is more power in knowing the grammar than being complacent with the default chart).

 

 

So you can see here that overall, the aoristoc number does not make much of a difference. Here is the same info for the 15 minute bins across the week.

!aoristic_day15min begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2.
 

 

You can see here the aoristic estimate smooths out the data quite a bit more (which is nice above and beyond just worrying about whether one approach is correct or not. With the smaller time bins you can also see patterns to over-report incidents at natural times of hour and half-hour intervals. You can also see midnight, noon and 08:00 are aberently popular times to report either beginning or end times of incidents. You can spot a few other ones as well that differ between begin and end times, for instance it appears 07:00 is a popular end time but not so popular a begin time. The obverse is true for events in the middle afternoon, late evening and early night (e.g. the big spikes in the green begin time line for hours between 17:00 till midnight). Also note that it is near universal that crime dips to its lowest around 4~5 am, and you can see using either the aoristic estimate or the middle point of the event brings the number of events up during this period (as expected).

Also in the set of functions I have the capability to specify an arbitrary category to split the results by, and here is an example splitting the day of week aoristic estimate by the beat variable provided with the Arlington data. Again this isn’t the direct code, but a subsequent GGRAPH command to produce a nicer chart (the original is ok if you make it much bigger, but with so many categories facet wrapping is appropriate to save space).

!aoristic_week begin_date = Date1 begin_time = Time1 end_date = Date2 end_time = Time2 split = Beat.
 

 

The main thing that draws attention in this graph is the difference in levels of calls and different trends between beats (there were no obvious differences in the aoristic estimates versus the naive estimates, which is unsurprising since most incidents don’t have uncertainty of over a day). I know nothing about Arlington, and I don’t know where these beats are, so I can’t say anything about why these differences may potentially occur. In SPSS days start at Sunday (so Sunday = 1 and Saturday = 7). It isn’t that strange to expect slightly more crime on Fridays and Saturdays (people out and about doing things that make them more vulnerable), but for most of the beats showing a flat profile this is not strange either. But it is interesting to see Beat 260 have an atypical pattern of obviously more crimes during the week, and if I had to hazard a guess I would assume there is a middle or high school in Beat 260.

Although you could argue aoristic analysis is called for based purely on theoretical grounds, all these examples show events that it doesn’t make much of a difference whether you use it or simpler methods. Where it is likely to make the most difference though are events which have the longest unknown time intervals. Property crimes tend to be committed when the victim is not around, and so here I compare the aoristic estimate for 15 minute intervals over the day for burglaries, which will show an example where the aoristic estimate makes an actual difference!

 

 

One can see the property crimes have a larger difference for the aoristic estimate across the day, and it is largely flat compared to begin and end times. The end and begin times are likely biased to report when people discovered that the victimization occurred or when they last left their home vulnerable. There is some slight trend for more burglaries to occur during the daytime, and somewhat higher periods during the night (with lulls around 08:00 and 18:00. These are near the exact opposite conclusions you would make with utilizing the begin and end times as to when most burglaries occurred! Middle times results in some weird differences as well, with a high spike in the 01:00 to 04:00 range.

Some closing

This project kicked my butt alittle, and took much longer than I expected. Certainly the code could use improvement and re-factoring, but I’m glad it is done (and seemingly working). You will see there is certainly alot of redundancy between the functions, some temporary variables are computed multiple times, and the week long functions take a while to compute.

In the SPSS macro what I do in a nutshell is make a variable for every time bin, calculate the weight each case has for that time bin, then reshape the dataset wide to long, then at last aggregate the total weight within each time bin. Note this results in many more cases than the original data. For example, the Arlington data I will display later in the post has slightly over 49,000 incidents. For the 15 minute intervals per day (96), this results in over 4.7 million cases (49,000*96). For the 1 hour bins across the week, this results in n times 168 more cases, for the 15 minute bins across the week in n times 672! Subsequently those latter two take an appreciable amount of time to compute for larger datasets (if you don’t run out of local memory on your computer entirely, which I’m guessing could easily happen for some of the older systems and when you have upwards of probably 60,000 cases).

For those interested, the bottle-neck is obviously the VARSTOCASES procedure. But, I have a substantive reason for going through that step, and that is if one wants to use the original data weighted, for say kernel density maps sliced by time of day having the data in that format (long with a field identifying the factor) is more convienant than in the wide format. Thinking about it I could generate NULL data for 0 weight categories and then drop those cases during the VARSTOCASES, but it remains to be seen if that will have much of an appreciable effect on real world datasets. Hopefully in the near future I will get the time to provide examples of that (probably in R using facetting and small multiple maps). If anyone has improvements to the code feel free to send them to me (or just shoot me an email).

In the future I plan on talking about some more visualization techniques to explore crime data with intervals like this. In particular I have a plot manipulating the grammar of graphics a bit to produce a visualization of individual incidents, but it still needs some work and writing up into a nice function. Here is an example though.

 

 


Citations

Wilson, O.W. 1963. Police Administration. McGraw-Hill.

Ratcliffe, Jerry H. 2002. Aoristic signatures and the spatio-temporal analysis of high volume crime patterns. Journal of Quantitative Criminology 18(1): 23-43. PDF Here.

Using the edge element in SPSS inline GPL with arbitrary X,Y coordinates

This post is intended to be a brief illustration of how one can utilize the edge element in SPSS’s GGRAPH statements to produce vector flow fields. I post this because the last time I consulted the GGRAPH manual (which admittedly is probably a few versions olds) they only have examples of SPSS’s ability to utilize random graph layouts for edge elements. Here I will show how to specify the algebra for edge elements if you already have X and Y coordinates for the beginning and end points for the edges. Also to note if you want arrow heads for line elements in your plots you need to utilize edge elements (the main original motivation for me to figure this out!)

So here is a simple example of utilizing coordinate begin and end points to specify arrows in a graph, and below the code is the image it produces.

DATA LIST FREE/ X1 X2 Y1 Y2 ID.
BEGIN DATA
1 3 2 4 1
1 3 4 3 2
END DATA.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X1 Y1 X2 Y2 ID MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X1=col(source(s), name("X1"))
  DATA: Y1=col(source(s), name("Y1"))
  DATA: X2=col(source(s), name("X2"))
  DATA: Y2=col(source(s), name("Y2"))
  DATA: ID=col(source(s), name("ID"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"))
  ELEMENT: edge(position((X1*Y1)+(X2*Y2)), shape(shape.arrow))
END GPL.

The magic happens in the call to the edge element, specifically the graph algebra position statement of (X1*Y1)+(X2*Y2). I haven’t read Wilkinson’s Grammar of Graphics, and I will admit with SPSS’s tutorial on graph algebra at the intro to the GPL manual it isn’t clear to me why this works. I believe the best answer I can say is that different elements have different styles to specify coordinates in the graph. For instance interval elements (e.g. bar charts) can take a location in one dimension and an interval in another dimension in the form of X*(Y1+Y2) (see an example in this post of mine on Nabble – I also just found in my notes an example of specifying an edge element in the similar manner to the interval element). This just happens to be a valid form to specify coordinates in edge elements when you aren’t using one of SPSS’s automatic graph layout rendering. I guess it is generically of the form FromNode(X*Y),ToNode(X*Y), but I haven’t seen any documentation for this and all of the examples I had seen in the reference guide utilize a different set of nodes and edges, and then specify a specific type of graph layout.

Here is another example visualizing a vector flow field. Eventually I would like to be able to superimpose such elements on a map – but that appears to not yet be possible in SPSS.

set seed = 10.
input program.
loop #i = 0 to 10.
loop #j = 0 to 10.
compute X = #i.
compute Y = #j.
*compute or_deg = RV.UNIFORM(0,360).
end case.
end loop.
end loop.
end file.
end input program.
dataset name sim.
execute.

*making orientation vary directly with X & Y attributes.

compute or_deg = 18*X + 18*Y.

*now to make the edge I would figure out the X & Y coordinate with alittle distance added (lets say .01) based on the orientation.
COMPUTE #pi=4*ARTAN(1).
compute or_rad = (or_deg/180)*#pi.
compute distance = .7.
execute.

compute X2 = X + sin(or_rad)*distance.
compute Y2 = Y + cos(or_rad)*distance.
execute.

DATASET ACTIVATE sim.
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y X2 Y2 or_deg MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: X2=col(source(s), name("X2"))
  DATA: Y2=col(source(s), name("Y2"))
  DATA: or_deg=col(source(s), name("or_deg"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"))
  ELEMENT: edge(position((X*Y)+(X2*Y2)), shape(shape.arrow))
END GPL.

You can then use other aesthetics in these charts same as usual (color, size, transparency, etc.)

Using Bezier curves to draw flow lines

As I talked about previously, great circle lines are an effective way to visualize flow lines, as the bending of the arcs creates displacement among over-plotted lines. A frequent question that comes up though (see an example on GIS.stackexchange and on the flowing data forums) is that great circle lines don’t provide enough bend over short distances. Of course for visualizing journey to crime data (one of the topics I am interested in), one has the problem that most known journey’s are within one particular jurisdiction or otherwise short distances.

In the GIS question I linked to above I suggested to utilize half circles, although that seemed like over-kill. I have currently settled on drawing an arcing line utilizing quadratic Bezier curves. For a thorough demonstration of Bezier curves, how to calculate them, and to see one of the coolest interactive websites I have ever come across, check out A primer on Bezier curves – by Mike "Pomax" Kamermans. These are flexible enough to produce any desired amount of bend (and are simple enough for me to be able to program!) Also I think they are more aesthetically pleasing than irregular flows. I’ve seen some programs use hook like bends (see an example of this flow mapping software from the Spatial Data Mining and Visual Analytics Lab), but I’m not all that fond of that for either aesthetic reasons or for aiding the visualization.

I won’t go into too great of details here on how to calculate them, (you can see the formulas for the quadratic equations from the Mike Kamermans site I referenced), but you basically, 1) define where the control point is located at (origin and destination are already defined), 2) interpolate an arbitrary number of points along the line. My SPSS macro is set to 100, but this can be made either bigger or smaller (or conditional on other factors as well).

Below is an example diagram I produced to demonstrate quadratic Bezier curves. For my application, I suggest placing a control point perpindicular to the mid point between the origin and destination. This creates a regular arc between the two locations, and conditional on the direction one can control the direction of the arc. In the SPSS function provided the user then provides a value of a ratio of the height of the control point to the distance between the origin and destination location (so points further away from each other will be given higher arcs). Below is a diagram using Latex and the Tikz library (which has a handy function to calulate Bezier curves).

Here is a simpler demonstration of the controlling the direction and adjusting the control point to produce either a flatter arc or an arc with more eccentricity.

Here is an example displaying 200 JTC lines from the simulated data that comes with the CrimeStat program. The first image are the original straight lines, and the second image are the curved lines using a control point at a height half the distance between the origin and destination coordinate.

Of course, both are most definately still quite crowded, but what do people think? Are my curved lines suggestion benificial in this example?

Here I have provided the SPSS function (and some example data) used to calculate the lines (I then use the ET Geowizards add-on to turn the points into lines in ArcGIS). Perhaps in the future I will work on an R function to calculate Bezier curves (I’m sure they could be of some use), but hopefully for those interested this is simple enough to program your own function in whatever language of interest. I have the starting of a working paper on visualizing flow lines, and I plan on this being basically my only unique contribution (everything else is just a review of what other people have done!)

One could be more fancy as well, and make the curves different based on other factors. For instance make the control point closer to either the origin or destination is the flow amount is assymetrical, or make the control point further away (and subsequently make the arc larger) is the flow is more volumous. Ideas for the future I suppose.

Visualization techniques for large N scatterplots in SPSS

When you have a large N scatterplot matrix, you frequently have dramatic over-plotting that prevents effectively presenting the relationship. Here I will give a few quick examples of simple ways to alter the typical default scatterplot to ease the presentation. I give examples in SPSS, although I suspect any statistical packages contains these options to alter the default scatterplot. At the end of the post I will link to SPSS code and data I used for these examples. For a brief background of the data, these are UCR index crime rates for rural counties by year in Appalachia from 1977 to 1996. This data is taken from the dataset Spatial Analysis of Crime in Appalachia, 1977-1996 posted on ICPSR (doi:10.3886/ICPSR03260.v1). While these scatterplots ignore the time dimension of the dataset, they are sufficient to demonstrate techniques to visualize big N scatterplots, as they result in over 7,000 county years to visualize.

So what is the problem with typical scatterplots for such large data? Below is an example default scatterplot in SPSS, plotting the Burglary Rate per 100,000 on the X axis versus the Robbery Rate per 100,000 on the Y axis. This uses my personal default chart template, but the problem is with the large over-plotted points in the scatter, which is the same for the default template that comes with installation.

The problem with this plot is that the vast majority of the points are clustered in the lower left corner of the plot. For the most part, the graph is expanded simply due to a few outliers in both dimesions (likely due to in part hetereoskedascity that comes with rates in low population areas). While the outliers will certainly be of interest, we kind of lose the forest for the trees in this particular plot.

Two simple suggestions to the base default scatterplot are to utilize smaller points and/or makes the points semi-transparent. On the left is an example of making the points smaller, and on the right is an example utilizing semi-transparency and small points. This de-emphasizes the outlier points (which could be good or bad depending on how you look at it), but allows one to see the main point cloud and the correlation between the two rates within it. (Note: you can open up the images in a new window to see them larger)

Note if you are using SPSS, to define semi-transparency you need to define it in the original GPL code (or in a chart template if you wanted), you can not do it post-hoc in the editor. You can make the points smaller in the editor, but editing charts with this many elements tends to be quite annoying, so to the extent you can specify the aesthetics in GPL I would suggest doing so. Also note making the elements smaller and semi-transparent can also be effectively utilized to visualize line plots, and I gave an example at the SPSS IBM forum recently.

Another option is to bin the elements, and SPSS has the options to either utilze rectangular bins or hexagon bins. Below is an example of each.

One thing that is nice about this technique and how SPSS handles the plot, a bin is only drawn if at least one point falls within it. Thus the outliers and the one high leverage point in the plot are still readily apparent. Other ways to summarize distributions (that are currently not available in SPSS) are sunflower plots or contour plots. Sunflower plots are essentially another way to display and summarize multiple overlapping points (see Carr et al., 1987 or an example from this blog post by Analyzer Assistant). Contour plots are drawn by smoothing the distribution and then plotting lines of equal density. Here is an example of a contour plot using ggplot2 in R on the Cross Validated Q/A site).

This advice can also be extended to scatterplot matrices. In fact such advice is more important in such plots, as the relationship is shrunk in a much smaller space. I talk about this some in my post on the Cross Validated blog, AndyW says Small Multiples are the Most Underused Data Visualization when I say reducing information into key patterns can be useful.

Below on the left is an example of the default SPSS scatter plot matrix produced through the Chart Builder, and on the right after editing the GPL code to make the points smaller and semi-transparent.

I very briefly experimented with adding a loess smooth line or using the binning techniques in SPSS but was not sucessful. I will have to experiment more to see if it can be effectively done in scatterplot matrices. I would like to extend some of the example corrgrams I previously made to plot the loess smoother and bivariate confidence ellipses, and you can be sure I will post the examples here on the blog if I ever get around to it.

The data and syntax used to produce the plots can be found here.

Not participating in SPSS google group anymore, recommend using other SPSS forums

I have participated in asking and answering questions at the SPSS google group for close to two years now. I am going to stop though, as I would recommend going other places to ask and answer questions. The SPSS google group has become over-ridden with spam. Much more spam than actual questions, and I have spent much more time for at least a few months marking spam rather than answering questions.

I have diligently been marking the spam that arises for quite some time, but unfortunately google has not taken any obvious action to prevent it from occurring. I noted this mainly because the majority of recent spam has repeatedly come from only two email addresses. It is sad, mainly because I have a gmail account and I am fairly sure such obvious spam emails would not make it through my gmail account, so I don’t know why they make it through to google groups.

I believe amounts of spam and actual questions have waxed and waned in the past, but I see little reason to continue to use the forum when other (better) alternatives exist. Fortunately I can recommend alternatives to ask questions related to conducting analysis in SPSS, and the majority of the same SPSS experts answer questions at many of the forums.

First I would recommend the Nabble SPSS group. I recommend it first mainly because it has the best pool of SPSS experts answering questions. It has an alright interface, that includes formatting responses in html and uploading attachments. One annoyance I have is that many automatic email replies get through when you post something on this list (so people listening set your automatic email replies  to not reply to list serve addresses). To combat this I have a mass email filter, and if I get an out of office reply in response to a posting your email address automatically goes to my email trash. In the future I just may send all responses from the list serve to the trash, as I follow questions/answers in the group using RSS feeds anway.

Second I would recommend CrossValidated if it has statistical content, and Stackoverflow if it is strictly related to programming. SPSS doesn’t have many questions on StackOverflow, but it is one of the more popular tags on CrossValidated. These have the downside that the expert community that answers questions related to SPSS specifically is smaller than Nabble (although respectable), but it has the advantage of a potential greater community input on other tangential aspects, especially related to advice about statistical analysis. Another advantage is the use of tags, and the ability to write posts in markdown as well as embed images. A downside is you can not directly attach files.

Third I would recommend the SPSS developerworks forum. There has been little to no community build up there, and so it is basically ask Jon Peck a question. For this reason I would recommend the other forums over the IBM provided forum. While the different sub sections for different topics is a nice idea, the Stackoverflow style of tags IMO works so much better it is hard to say nice things about the other list-serve or forum types of infrastructure. Jon Peck answers questions at all of the above places as well, so it is not like you are missing his input by choosing stack overflow as opposed to the IBM forum.

I hope in the future the communities will all just pick one place (and I would prefer all the experts migrate to CrossValidated and StackOverflow for many reasons). But at least no one should miss the google group with the other options available.