Why great circle lines look nicer in flow maps

I got sick of working on my dissertation the other day so I started writing a review article on visualizing flow lines for journey to crime data. Here I will briefly illustrate why great circle lines tend to look nicer in flow maps than do straight lines.

Flow maps tend to be very visually complicated, and so what happens (to a large extent) is what happens in Panel B in the above graphic. Bending the lines, as is done with great circles, tends to displace the lines from one another to a greater extent. Although perfect overlap as is demonstrated in the picture doesn’t necessarily happen that frequently, the same logic applies to nearly overlapped lines. One of the nicest examples of this you can find is the facebook friends map that made the internet rounds (note there are many other aesthetic elements in the plot that make it look nice besides just the great circle lines).

Of course with great circle lines you don’t get the bending in the other direction for reciprocal flows I demonstrate in my first figure (the great circle line is the same regardless of direction). Because of this, and because when using a local projection great circles lines don’t really provide enough eccentricity in the bend to produce the desired displacement of the lines, I suggested to utilize half circles and discuss how to calculate them given a set of origin-destination coordinates at this question on the GIS site.

I need to test this out in the wild some more though. I suspect a half-circle is too much, but my attempts to script a version where the eccentricity is less pronounced has befuddled me so far. I will post an update on here if I come to a better solution, and when the working paper is finished I will post a copy of that as well. Preferably I would like the script to take an arbitrary parameter to control the amount of bend in the arc, so if you have suggestions feel free to shoot me an email or leave a comment here.

For those interested in the topic I would suggest to peruse one of my other answers at the GIS site. Therein I give a host of references and online mapping examples of visualizing flows.

Visualization techniques for large N scatterplots in SPSS

When you have a large N scatterplot matrix, you frequently have dramatic over-plotting that prevents effectively presenting the relationship. Here I will give a few quick examples of simple ways to alter the typical default scatterplot to ease the presentation. I give examples in SPSS, although I suspect any statistical packages contains these options to alter the default scatterplot. At the end of the post I will link to SPSS code and data I used for these examples. For a brief background of the data, these are UCR index crime rates for rural counties by year in Appalachia from 1977 to 1996. This data is taken from the dataset Spatial Analysis of Crime in Appalachia, 1977-1996 posted on ICPSR (doi:10.3886/ICPSR03260.v1). While these scatterplots ignore the time dimension of the dataset, they are sufficient to demonstrate techniques to visualize big N scatterplots, as they result in over 7,000 county years to visualize.

So what is the problem with typical scatterplots for such large data? Below is an example default scatterplot in SPSS, plotting the Burglary Rate per 100,000 on the X axis versus the Robbery Rate per 100,000 on the Y axis. This uses my personal default chart template, but the problem is with the large over-plotted points in the scatter, which is the same for the default template that comes with installation.

The problem with this plot is that the vast majority of the points are clustered in the lower left corner of the plot. For the most part, the graph is expanded simply due to a few outliers in both dimesions (likely due to in part hetereoskedascity that comes with rates in low population areas). While the outliers will certainly be of interest, we kind of lose the forest for the trees in this particular plot.

Two simple suggestions to the base default scatterplot are to utilize smaller points and/or makes the points semi-transparent. On the left is an example of making the points smaller, and on the right is an example utilizing semi-transparency and small points. This de-emphasizes the outlier points (which could be good or bad depending on how you look at it), but allows one to see the main point cloud and the correlation between the two rates within it. (Note: you can open up the images in a new window to see them larger)

Note if you are using SPSS, to define semi-transparency you need to define it in the original GPL code (or in a chart template if you wanted), you can not do it post-hoc in the editor. You can make the points smaller in the editor, but editing charts with this many elements tends to be quite annoying, so to the extent you can specify the aesthetics in GPL I would suggest doing so. Also note making the elements smaller and semi-transparent can also be effectively utilized to visualize line plots, and I gave an example at the SPSS IBM forum recently.

Another option is to bin the elements, and SPSS has the options to either utilze rectangular bins or hexagon bins. Below is an example of each.

One thing that is nice about this technique and how SPSS handles the plot, a bin is only drawn if at least one point falls within it. Thus the outliers and the one high leverage point in the plot are still readily apparent. Other ways to summarize distributions (that are currently not available in SPSS) are sunflower plots or contour plots. Sunflower plots are essentially another way to display and summarize multiple overlapping points (see Carr et al., 1987 or an example from this blog post by Analyzer Assistant). Contour plots are drawn by smoothing the distribution and then plotting lines of equal density. Here is an example of a contour plot using ggplot2 in R on the Cross Validated Q/A site).

This advice can also be extended to scatterplot matrices. In fact such advice is more important in such plots, as the relationship is shrunk in a much smaller space. I talk about this some in my post on the Cross Validated blog, AndyW says Small Multiples are the Most Underused Data Visualization when I say reducing information into key patterns can be useful.

Below on the left is an example of the default SPSS scatter plot matrix produced through the Chart Builder, and on the right after editing the GPL code to make the points smaller and semi-transparent.

I very briefly experimented with adding a loess smooth line or using the binning techniques in SPSS but was not sucessful. I will have to experiment more to see if it can be effectively done in scatterplot matrices. I would like to extend some of the example corrgrams I previously made to plot the loess smoother and bivariate confidence ellipses, and you can be sure I will post the examples here on the blog if I ever get around to it.

The data and syntax used to produce the plots can be found here.

Not participating in SPSS google group anymore, recommend using other SPSS forums

I have participated in asking and answering questions at the SPSS google group for close to two years now. I am going to stop though, as I would recommend going other places to ask and answer questions. The SPSS google group has become over-ridden with spam. Much more spam than actual questions, and I have spent much more time for at least a few months marking spam rather than answering questions.

I have diligently been marking the spam that arises for quite some time, but unfortunately google has not taken any obvious action to prevent it from occurring. I noted this mainly because the majority of recent spam has repeatedly come from only two email addresses. It is sad, mainly because I have a gmail account and I am fairly sure such obvious spam emails would not make it through my gmail account, so I don’t know why they make it through to google groups.

I believe amounts of spam and actual questions have waxed and waned in the past, but I see little reason to continue to use the forum when other (better) alternatives exist. Fortunately I can recommend alternatives to ask questions related to conducting analysis in SPSS, and the majority of the same SPSS experts answer questions at many of the forums.

First I would recommend the Nabble SPSS group. I recommend it first mainly because it has the best pool of SPSS experts answering questions. It has an alright interface, that includes formatting responses in html and uploading attachments. One annoyance I have is that many automatic email replies get through when you post something on this list (so people listening set your automatic email replies  to not reply to list serve addresses). To combat this I have a mass email filter, and if I get an out of office reply in response to a posting your email address automatically goes to my email trash. In the future I just may send all responses from the list serve to the trash, as I follow questions/answers in the group using RSS feeds anway.

Second I would recommend CrossValidated if it has statistical content, and Stackoverflow if it is strictly related to programming. SPSS doesn’t have many questions on StackOverflow, but it is one of the more popular tags on CrossValidated. These have the downside that the expert community that answers questions related to SPSS specifically is smaller than Nabble (although respectable), but it has the advantage of a potential greater community input on other tangential aspects, especially related to advice about statistical analysis. Another advantage is the use of tags, and the ability to write posts in markdown as well as embed images. A downside is you can not directly attach files.

Third I would recommend the SPSS developerworks forum. There has been little to no community build up there, and so it is basically ask Jon Peck a question. For this reason I would recommend the other forums over the IBM provided forum. While the different sub sections for different topics is a nice idea, the Stackoverflow style of tags IMO works so much better it is hard to say nice things about the other list-serve or forum types of infrastructure. Jon Peck answers questions at all of the above places as well, so it is not like you are missing his input by choosing stack overflow as opposed to the IBM forum.

I hope in the future the communities will all just pick one place (and I would prefer all the experts migrate to CrossValidated and StackOverflow for many reasons). But at least no one should miss the google group with the other options available.

Dressing for success in academic interviews

On the academia stack exchange site a recent question came up about how to dress for academic interviews. The verdict was over-dressing is better than under-dressing. I had come across similar (although somewhat discordant) advice previously, but it is always good to have second opinions.

I will mention the previous blog posts I had come across, as they have good advice for giving academic job talks in general;

I think my advice is just give a talk so awesome no one cares about what you wear! But, take that advice with care, it is coming from a person who not only doesn’t know what he wore yesterday, but has on occasion wore shirts backwards (in public) all day long. I would probably only notice if you showed up dressed like Dilbert in this strip;

If you have other questions related to academia head on over and check out the Academia stack exchange site. Here a few examples of some of my favorite discussions so far;

Bean plots in SPSS

It seems like I have come across alot of posts recently about visualizing univariate distributions. Besides my own recent blog post about comparing distributions of unequal size in SPSS, here are a few other blog posts I have recently come across;

Such a variety of references is not surprising though. Examining univariate distributions is a regular task for data analysis and can tell you alot about the nature of data (including potential errors in the data). Here are some posts on the Cross Validated Q/A site of related interest I have compiled;

In particular the recent post on bean plots and Luca Fenu’s post motivated my playing around with SPSS to produce the bean plots here. Note Jon Peck has published a graphboard template to generate violin plots for SPSS, but here I will show how to generate them in the usual GGRAPH commands. It is actually pretty easy, and here I extend the violin plots to include the beans suggested in bean plots!

A brief bit about the motivation for bean plots. Besides consulting the article by Peter Kampstra, one is interested in viewing a univariate continuous distribution among a set of different categories. To do this one uses a smoothed kernel density estimate of the distribution for each of the subgroups. When viewing the smoothed distribution though one loses the ability to identify patterns in the individual data points. Patterns can mean many things, such as outliers, or patterns such as striation within the main body of observations. The bean plot article gives an example where striation in measurements at specific inches can be seen. Another example might be examining the time of reported crime incidents (they will have bunches at the beginning of the hour, as well as 15, 30, & 45 minute marks).

Below I will go through a brief series of examples demonstrating how to make bean plots in SPSS.


SPSS code to make bean plots

First I will make some fake data for us to work with.

******************************************.
set seed = 10.
input program.
loop #i = 1 to 1000.
compute V1 = RV.NORM(0,1).
compute groups = TRUNC(RV.UNIFORM(0,5)).
end case.
end loop.
end file.
end input program.
dataset name sim.
execute.

value labels groups
0 'cat 0'
1 'cat 1'
2 'cat 2'
3 'cat 3'
4 'cat 4'.
******************************************.

Next, I will show some code to make the two plots below. These are typical kernel density estimates of the V1 variable I made for the entire distribution, and these are to show the elements of the base bean plots. Note the use of the TRANS statement in the GPL to make a constant value to plot the rug of the distribution. Also note although such rugs are typically shown as bars, you could pretty much always use point markers as well in any situation where you use bars. Below the image is the GGRAPH code used to produce them.

******************************************.
*Regular density estimate with rug plot.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: V1=col(source(s), name("V1"))
  TRANS: rug = eval(-26)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), label("Density"))
  SCALE: linear(dim(2), min(-30))
  ELEMENT: interval(position(V1*rug), transparency.exterior(transparency."0.8"))
  ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))
END GPL.

*Density estimate with points instead of bars for rug.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: V1=col(source(s), name("V1"))
  TRANS: rug = eval(-15)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), label("Density"))
  SCALE: linear(dim(2), min(-30))
  ELEMENT: point(position(V1*rug), transparency.exterior(transparency."0.8"))
  ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))
END GPL.
******************************************.

Now bean plots are just the above plots rotatated 90 degrees, adding a reflection of the distribution (so the area of the density is represented in two dimensions), and then further paneled by another categorical variable. To do the reflection, one has to create a fake variable equal to the first variable used for the density estimate. But after that, it is just knowing alittle GGRAPH magic to make the plots.

******************************************.
compute V2 = V1.

varstocases
/make V from V1 V2
/index panel_dum.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V panel_dum groups MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  COORD: transpose(mirror(rect(dim(1,2))))
  DATA: V=col(source(s), name("V"))
  DATA: panel_dum=col(source(s), name("panel_dum"), unit.category())
  DATA: groups=col(source(s), name("groups"), unit.category())
  TRANS: zero = eval(10)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), null())
  GUIDE: axis(dim(3), null())
  SCALE: linear(dim(2), min(0))
  ELEMENT: area(position(density.kernel.epanechnikov(V*1*panel_dum*1*groups)), transparency.exterior(transparency."1.0"), transparency.interior(transparency."0.4"), 
           color.interior(color.grey), color.exterior(color.grey)))
  ELEMENT: interval(position(V*zero*panel_dum*1*groups), transparency.exterior(transparency."0.8"))
END GPL.
    ******************************************.

Note I did not label the density estimate anymore. I could have, but I would have had to essentially divide the density estimate by two, since I am showing it twice (which is possible, and if you wanted to show it you would omit the GUIDE: axis(dim(2), null()) command). But even without the axis they are still reasonable for relative comparisons. Also note the COORD statement for how I get the panels to mirror each other (the transpose statement just switches the X and Y axis in the charts).

I just post hoc edited the chart to get it to look nice (in particular settign the spacing between the panel_dum panel to zero and making the panel outlines transparent), but most of those things can likley be more steamlined by making an appropriate chart template. Two things I do not like, which I may need to edit the chart template to be able to accomplish anyway; 1) There is an artifact of a white line running down the density estimates, (it is hard to see with the rug, but closer inspection will show it), 2) I would prefer to have a box around all of the estimates and categories, but to prevent a streak running down the middle of the density estimates one needs to draw the panel boxes without borders. To see if I can accomplish these things will take further investigation.

This framework is easily extended to the case where you don’t want a reflection of the same variable, but want to plot the continuous distribution estimate of a second variable. Below is an example, and here I have posted the syntax in entirety used in making this post. In there I also have an example of weighting groups inversely proportional to the total items in each group, which should make the area of each group equal.

In this example of comparing groups, I utilize dots instead of the bar rug, as I believe it provides more contrast between the two distributions. Also note in general I have not superimposed other summary statistics (some of the bean plots have quartile lines super-imposed). You could do this, but it gets a bit busy.

Comparing continuous distributions of unequal size groups in SPSS

The other day I had the task of comparing two distributions of a continous variable between two groups. One complication that arose when trying to make graphical comparisons was that the groups had unequal sample sizes. I’m making this blog post mainly because many of the options I will show can’t be done in SPSS directly through the graphical user interface (GUI), but understanding alittle bit about how the graphic options work in the GPL will help you make the charts you want to make without having to rely solely on what is available through the GUI.

The basic means I typically start out at are histograms, box-plots and a few summary statistics. The beginning code is just how I generated some fake data to demonstrate these graphics.

SET TNumbers=Labels ONumbers=Labels OVars=Labels TVars=Labels.
dataset close ALL.
output close ALL.
*making fake cases data.
set seed = 10.
input program.
loop #i = 1 to 5000.
if #i <= 1500 group = 1.
if #i > 1500 group = 2.
end case.
end loop.
end file.
end input program.
dataset name sim.
execute.

*making approximate log normal data.
if group = 1 time_event = (RV.LNORMAL(0.5,0.6))*10.
if group = 2 time_event = (RV.LNORMAL(0.6,0.5))*10.

variable labels time_event 'Time to Event'.
value labels group 
1 'Group 1'
2 'Group 2'.
formats group time_event (F3.0).

variable level group (nominal).

*Good First Stabs are Histograms and Box plots and summary statistics.
GRAPH
  /HISTOGRAM=time_event
  /PANEL ROWVAR=group ROWOP=CROSS.

EXAMINE VARIABLES=time_event BY group
  /PLOT=BOXPLOT
  /STATISTICS=NONE
  /NOTOTAL. 

So this essentially produces a summary statistics table, a paneled histogram, and a box-plot (shown below).

First blush this is an alright way to visually assess various characteristics of each distribution, and the unequal sizes of each group is not problematic when comparing the summary statistics nor the box-plots. The histogram produced by SPSS though is the frequency of events per bin, and this makes it difficult to compare Group 2 to Group 1, as Group 2 has so many more observations. One way to normalize the distributions is to make a histogram showing the percent of the distribution that falls within that bin as oppossed to the frequency. You can actually do this through the GUI through the Chart Builder, but it is buried within some various other options, below is a screen shot showing how to change the histogram from frequency to percents. Also to note, you need to change what the base percentage is built off of, by clicking the Set Parameters button (circled in red) and then toggling the denominator choice in the new pop up window to total for each panel (if you click on the screen shot images they will open up larger images).

Sometimes you can’t always get to what you want through the chart builder GUI though. For an example, I originally wanted to make a population pyramid type chart, and it does not allow you to specify the base percent like that through the GUI. So I originally made a pyramid chart like this;

And here is what the pasted output appears like.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event group MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: group=col(source(s), name("group"), unit.category())
  COORD: transpose(mirror(rect(dim(1,2))))
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(1), opposite(), label("Time to Event"))
  GUIDE: axis(dim(2), label("Frequency"))
  GUIDE: axis(dim(3), label("group"), opposite(), gap(0px))
  GUIDE: legend(aesthetic(aesthetic.color), null())
  SCALE: cat(dim(3), include("1", "2"))
  ELEMENT: interval(position(summary.count(bin.rect(time_event*1*group))), color.interior(group))
END GPL.

To get the percent bins instead of the count bins takes one very simple change to summary specification on the ELEMENT statement. One would simply insert summary.percent.count instead of summary.count. Which will approximately produce the chart below.

You can actually post-hoc edit the traditional histogram to make a population pyramid (by mirroring the panels), but by examining the GPL produced for the above chart gives you a glimpse of the potential possibilities you can do to produce a variety of charts in SPSS.

Another frequent way to assess continuous distributions like those displayed so far is by estimating kernel density smoothers through the distribution (sometime referred by the acronym kde (e is for estimate). Sometimes this is perferable because our perception of the distribution can be too highly impacted by the histogram bins. Kernel density smoothers aren’t available through the GUI at all though (as far as I’m aware), and so you would have only known the potential exisited if you looked at the examples in the GPL reference guide that comes with the software. Below is an example (including code).

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event group MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Kernel Density Estimate"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  ELEMENT: line(position(density.kernel.epanechnikov(time_event*group)), color(group))
END GPL.

Although the smoothing is useful, again we have a problem with the unequal number of cases in the distributions. To solve this, I weighted cases inversely proportional to the number of observations that were in each group (i.e. the weight for group 1 is 1/1500, and the weight for group 2 is 1/3500 in this example). This should make the area underneath the lines sum to 1, and so to get the estimate back on the original frequency scale you would simply multiply the marginal density estimate by the total in the corresponding group. So for instance, the marginal density for group 2 at the time to event value of 10 is 0.05, so the estimated frequency given 3500 cases is .05 * 3500 = 175. To get back on a percentage scale you would just multiply by 100.

AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=group
  /cases=N.
compute myweight = 1/cases.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event group myweight MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"), weight(weightedVar))
  DATA: weightedVar=col(source(s), name("myweight"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Weighted Kernel Density Estimate"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  GUIDE: text.footnote(label("Density is weighted inverse to the proportion of cases within each group. The number of cases in group 1 equals 1,500, and the number of cases ingroup 2 equals 3,500."))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  SCALE: linear(dim(2))
  ELEMENT: line(position(density.kernel.epanechnikov(time_event*group)), color(group))
END GPL.

One of the critiques of this though is that choosing a kernel and bandwidth is ad-hoc (I just used all of the default kernal and bandwidth in SPSS here, and it differed in unexpected ways between the frequency counts and the weighted estimates which is undesirable). Also you can see that some of the density is smoothed over illogical values in this example (values below 0). Other potential plots are the cumualitive distribution and QQ-plots comparing the quantiles of each distribution to each other. Again these are difficult to impossible to obtain through the GUI. Here is the closest I could come to getting a cumulative distribution by groups through the GUI.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event COUNT()[name="COUNT"] group 
    MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: COUNT=col(source(s), name("COUNT"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Cumulative Percent of Total"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  ELEMENT: line(position(summary.percent.cumulative(time_event*COUNT, base.all(acrossPanels()))), 
    color.interior(group), missing.wings())
END GPL.

This is kind of helpful, but not really what I want. I wasn’t quite sure how to change the summary statistic functions in the ELEMENT statement to calculate percent within groups (I assume it is possible, but I just don’t know how), so I ended up just making the actual data to include in the plot. Example syntax and plot below.

sort cases by group time_event.
compute id = $casenum.
AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=group
  /id_min=MIN(id)
  /id_max=MAX(id).
compute cum_prop = ((id +1) - id_min)/(id_max - (id_min - 1)).


*Here is the cumulative proportion I want.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event cum_prop group MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event=col(source(s), name("time_event"))
  DATA: cum_prop=col(source(s), name("cum_prop"))
  DATA: group=col(source(s), name("group"), unit.category())
  GUIDE: axis(dim(1), label("Time to Event"))
  GUIDE: axis(dim(2), label("Cumulative Percent within Groups"))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("1", "2"))
  ELEMENT: line(position(time_event*cum_prop), color.interior(group), missing.wings())
END GPL.

These cumulative plots aren’t as problematic with bins as are the histograms or kde estimates, and in fact many interesting questions are much easier addressed with the cumulative plots. For instance if I wanted to know the proportion of events that happen within 10 days (or its complement, the proportion of events that do not yet occur within 10 days) this is an easy task with the cumulative plots. This would be at best extremely difficult to determine with the histogram or density estimates. The cumulative plot also gives a graphical comparisons of the distribution (although perhaps not as intuitive as the histogram or kde estimates). For instance it is easy to see the location of group 2 is slightly shifted to the right.

The last plot I present is a QQ-plot. These are typically presented as plotting an empirical distribution against a theoretical distribution, but you can plot two empirical distributions against each other. Again you can’t quite get the QQ-plot of interest though the regular GUI, and you have to do some data manipulation to be able to construct the elements of the graph. You can do QQ-plots against a theoretical distribution in the PPLOT command, so you could make seperate QQ plots for each subgroup, but this is less than ideal. Below I paste an example of my constructed QQ-plot, along with syntax showing how to use the PPLOT command for seperate sub-groups (using SPLIT FILE) and getting the quantiles of intrest using the RANK command.

sort cases by group time_event.
split file by group.
PPLOT
  /VARIABLES=time_event
  /NOLOG
  /NOSTANDARDIZE
  /TYPE=Q-Q
  /FRACTION=BLOM
  /TIES=MEAN
  /DIST=LNORMAL.
split file off.

*Not really what I want - I want Q-Q plot of one group versus the other group.
RANK VARIABLES=time_event (A) BY group
  /NTILES(99)
  /PRINT=NO
  /TIES=MEAN.

*Now aggregating to new dataset.
DATASET DECLARE quantiles.
AGGREGATE
  /OUTFILE='quantiles'
  /BREAK=group Ntime_ev 
  /time_event=MAX(time_event).
dataset activate quantiles.

sort cases by Ntime_ev group.
casestovars
/id = Ntime_ev
/index = group.

DATASET ACTIVATE quantiles.
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=time_event.1[name="time_event_1"] 
    time_event.2[name="time_event_2"] MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: time_event_1=col(source(s), name("time_event_1"))
  DATA: time_event_2=col(source(s), name("time_event_2"))
  GUIDE: axis(dim(1), label("Quantiles Time to Event Group 1"))
  GUIDE: axis(dim(2), label("Quantiles Time to Event Group 2"))
  ELEMENT: point(position(time_event_1*time_event_2))
  ELEMENT: line(position(time_event_1*time_event_1))
END GPL.

Although I started out with a simple question, it takes a fair bit of knowledge about both graphically comparing distributions and data management (i.e. how to shape your data) to be able to make all of these types of charts in SPSS. I intentionally made the reference distributions very similar, and if you just stuck with the typical histogram the slight differences in location and scale between the two distributions would not be as evident as it is with the kernel density, the cumulative distribution or the QQ-plots.

Beware of Mach Bands in Continuous Color Ramps

A recent post of mine on the cross validated statistics site addressed how to make kernel density maps more visually appealing. The answer there was basically just adjust the bandwidth until you get a reasonably smoothed surface (where reasonable means not over-smoothed to one big hill or undersmoothed to a bunch of unconnected hills).

Another problem that frequently comes along with the utlizing the default types of raster gradients is that of mach bands. Here is a replicated image I used in the cross validated site post (made utilizing the spatstat R library).

Even though the color ramp is continous, you see some artifacts around the gradient where the hue changes from what our eyes see as green to blue. To be more precise, approximately where the green hue touches the blue hue the blue color appears to be lighter than the rest of the blue background. This is not the case though, and is just an optical illusion (you can even see the mach bands in the legend if you look close). Mark Monmonier in How to Lie with Maps gives an example of this, and also uses that as a reason to not use continous color ramps (also another reason he gives is it is very difficult to map a color to an exact numerical location on the ramp). To note this isn’t just something that happens with this particular color ramp, this happens even when the hue is the same (the wikipedia page gives an example with varying grey saturation).

So what you say? Well, part of the reason it is a problem is because the artifact reinforces unnatural boundaries or groupings in the data, the exact opposite of what one wants with a continuous color ramp! Also the groupings are largely at the will of the computer, and I would think the analyst wants to define the groupings themselves when disseminating the maps (although this brings up another problem with how to define the color breaks). A general principle with how people interpret such maps is that they tend to form homogenous groupings anyway, so for both exploratory purposes and disseminating maps we should keep this in mind.

This isn’t a problem limited to isopleth maps either, the Color Brewer online app is explicitly made to demonstrate this phenonenom for choropleth maps visualizing irregular polygons. What happens is that one county that is spatially outlying compared to its neighbors appears more extreme on the color gradient than when it is surrounded by colors with the same hue and saturation. Below is a screen shot of what I am talking about, with some of the examples circled in red. They are easy to see that they are spatially outlying, but harder to map to the actual color on the ramp (and it gets harder when you have more bins).

Even with these problems I think the default plots in the spatstat program are perfectly fine for exploratory analysis. I think to disseminate the plots though I would prefer discrete bins in many (perhaps most) situations. I’ll defer discussion on how to choose the bins to another time!

Co-maps and Hot spot plots! Temporal stats and small multiple maps to visualize space-time interaction.

One of the problems with visualizing and interpreting spatial data is that there are characteristics of the geographical data that are hard to display on a static, two dimensional map. Friendly (2007) makes the pertinent distinction between map and non-map based graphics, and so the challenge is to effectively interweave them. One way to try to overcome this is to create graphics intended to supplement the map based data. Below I give two examples pertinent to analyzing point level crime patterns with attached temporal data, co-maps (Brunsdon et al., 2009) and the hot spot plot (Townsley, 2008).

co-maps

The concept of co-maps is an extension of co-plots, a visualization technique for small multiple scatterplots originally introduced by William Cleveland (1994). Co-plots are in essence a series of small multiples scatterplots in which the visualized scatter plot is conditioned on a third (or potentially fourth) variable. What is unique about co-plots are though the conditioning variable(s) is not mutually exclusive between categories, so the conditions overlap.

The point of co-plots is in general to see if the relationship between two variables has an interaction with a third continuous variable. When the conditioning variable is continuous, we wouldn’t expect the interaction to change dramatically with discrete cut-offs of the continuous variable, so we want to examine the interaction effect at varying levels of the conditioning variable. It is also useful in instances in which the data is sparse, and you don’t want to introduce artifactual relationships by making arbitrary cut-offs for the conditioning variable.

Besides the Cleveland paper cited (which is publicly available, link in citations at bottom of post), there are some good examples of coplot scatterplots from the R graphical manual.

Brunsdon et al. (2009) extend the concept to analyzing point patterns, when time is the conditioning variable. Also because the geographic data are numerous, they apply kernel density estimation (kde) to visualize the results (instead of a sea of overlapping points). When visualizing geographic data, too many points are common, and the solutions to visualizing the data are essentially the same as people use for scatterplots (this thread at the stats site gives a few resources and examples concerning that). Below I’ve copied a picture from Brusdon et al., 2009 to show it applied to crime data.

Although the example is conditional on temperature (instead of time), it should be easy to see how it could be extended to make the same plot conditional on time. Also note the bar graph at the top denotes the temperature range, with the lowest bar corresponding to the graphic that is in the panel on the bottom left.

Also of potential interest, the same authors applied the same visualization technique to reported fires in another publication (Corcoran et al., 2007).

the hot spot plot

Another similarly motivated graphical presentation of the interaction of time and space is the hot-spot plot proposed by Michael Townsley (2008). Below is an example.

So the motivation here is having coincident graphics simulataneously depicting long term temporal trends (in a sparkline like graphic at the top of the plot), spatial hot spots depicted using kde, and a lower bar graphic depicting hourly fluctuations. This allows one to identify spatial hot spots, and then quickly assess their temporal nature. The example from the Townsley article I give is a secondary plot showing zoomed in locations of several analyst chosen hot spots, with the cut out remaining events left as a baseline.

Some food for thought when examing space-time trends with point pattern crime data.


Citations

Connect with me!

In taking the advice of this question on the Academia stack exchange site, Is web-presence important for researchers?, I’ve created this blog and joined several other social networking sites. If you are interested in my work, or think I would be interested in yours, feel free to connect with me in any of these following venues.

Are there any other sites I should be joining? Let me know in the comments.

Making a reproducible example in SPSS

Since I participate on several sites in which programming related questions in regards to SPSS appear (StackOverflow and the SPSS Google group forum mainly), I figured it would useful to share some code snippets that would allow one to make a minimal working example to demonstrate what the problem is (similar in flavor to this question on StackOverflow for R).

Basically this involves two steps; 1) making some fake data (or using an existing dataset), 2) including the code that produces the error or a description of what you would like to accomplish. Since 2 will vary depending on your situation, here I will just be demonstrating the first part, making some fake data.

There are four main ways I use syntax to generate fake data on a regular basis to work with. Below I will demonstrate them;

INPUT PROGRAM

*******************************.
set seed = 10. /* sets random seed generator to make exact data reproducible */.
input program.
loop #j = 1 to 100. /*I typically use scratch variables (i.e. #var) when making loops.
    loop #i = 1 to 100. /*multiple loops allows you to make grouped data.
    compute V1 = RV.NORM(0,1). /*you can use the random number generators to make different types of data.
    compute V2 = RV.UNIFORM(0,1).
    compute V3 = RV.POISSON(3).
    compute V4 = RV.BERNOULLI(.5).
    compute V5 = RV.BINOM(5,.8).
    compute mycat = TRUNC(RV.UNIFORM(0,5)). /*this makes categorical data with 4 groups.
    compute group = #j. /*this assigns the scratch variable #j to an actual variable.
    end case.
    end loop.
end loop.
end file.
end input program.
dataset name sim.
execute. /*note spacing is arbitrary and is intended to make code easier to read.
*******************************.

Using an input program block and the random variable functions provided by SPSS is my most frequent way to make data to work with. Above I also demonstrate the ability to make grouped data by using two loops in the input program block, as well as a variety of different data types using SPSS’s random number generator. This is also the best way to make big data, for an example use you can see this question I answered on Stackoverflow, How to aggregate on IQR in SPSS? which required an example with 4 million cases.

DATA LIST

*******************************.
data list free / V1 (F2.0) V2 (F2.0) V3 (A4).
begin data
1 2 aaaa
3 4 bbbb
5 6 cccc
end data.
dataset name input.
*******************************.

Using data list is just SPSS’s way to read in plain text data. An example where this came in handy was another question I answered on Stackoverflow, How to subtract certain minuted from a DateTime in SPSS. There I read in some custom data-time data as strings and demonstrated how to convert it to actual date-time variables. I also used this recently on a question over at the Developerworks forum to show some plotting capabilities, Plotting lines and error bars. It was just easier in that instance to make some fake data that conformed to how I needed the data in GPL than going through a bunch of transformations to shape the data.

GET FILE

*******************************.
*Base datasets that come with SPSS.
FILE HANDLE base_data /Name = "C:\Program Files\SPSSInc\Statistics17\Samples\English".
get file = "base_data\Cars.sav".
dataset name cars.
get file = "base_data\1991 U.S. General Social Survey.sav".
dataset name gss.
*there are a bunch more data files in there.
*******************************.

SPSS comes with a bunch of example datasets, and you can insert some simple code to grab one of those. Note here I use FILE HANDLE, making it easier for someone to update their the code to the location of their own data (same for saving data files). Also this logic could be used in an instance if you upload your exact data to say dropbox to allow people to download it.

Data with cases Python program

*******************************.
begin program.
import statsgeneratedata
numvar = 3
numcase = 100
parms = "0 1 "
dsname = "python_sim"
factor = "nofactor"
corrtype = "ARBITRARY"
corrs = "1 .5 1 .5 .5 1"
displaycorr="noprint"
distribution = "NORMAL"
displayinputpgm = "noprint"
statsgeneratedata.generate(dsname, numvar, numcase, distribution, parms, 
  factor, corrtype, corrs, displaycorr, displayinputpgm)
end program.
*******************************.

This uses a custom python program makedata available as a download from SPSS developerworks. Although I provide code above, once installed it comes with its own GUI. Although I don’t typically use this when answering questions (as it requires having Python installed) it has certainly come in handy for my own analysis, especially the ability to generate correlated data.


This isn’t the only part of making an easy to answer question, but having some data at hand to demonstrate your problem is a good start. It certainly makes the work of others who are trying to help easier. Also see a (when I am writing this) recent exchange on posting to the NABBLE SPSS group. IMO making some example data to demonstrate your problem is a very good start to asking a clear and cogent question.