All posts by Andy Wheeler

Estimating group based trajectory models using SPSS and R

For a project I have been estimating group based trajectory models for counts of crime at micro places. Synonymous with the trajectory models David Weisburd and colleagues estimated for street segments in Seattle. Here I will show how using SPSS and the R package crimCV one can estimate similar group based trajectory models. Here is the crimCV package help and here is a working paper by the authors on the methodology. The package specifically estimates a zero inflated poisson model with the options to make the 0-1 part and/or the count part have quadratic or cubic terms – and of course allows you specify the number of mixture groups to search for.

So first lets make a small set of fake data to work with. I will make 100 observations with 5 time points. The trajectories are three separate groups (with no zero inflation).

*Make Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 100.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME OrigData.
*Make 3 fake trajectory profiles.
VECTOR Count(5,F3.0).
DO REPEAT Count = Count1 TO Count5 /#iter = 1 TO 5.
COMPUTE #i = #iter - 3.
COMPUTE #ii  = #i**2.
COMPUTE #iii = #i**3.
  DO IF Id <= 30.
    COMPUTE #P    = 10 + #i*0.3 + #ii*-0.1 + #iii*0.05.
    COMPUTE Count = RV.POISSON(#P).
  ELSE IF Id <=60.
    COMPUTE #P    =  5 + #i*-0.8 + #ii*0.3 + #iii*0.05.
    COMPUTE Count = RV.POISSON(#P).
  ELSE. 
    COMPUTE #P    =  4 + #i*0.8 + #ii*-0.5 + #iii*0.
    COMPUTE Count = RV.POISSON(#P).
  END IF.
END REPEAT.
FORMATS Id Count1 TO Count5 (F3.0).
EXECUTE.

Note The crimCV package wants the data to be wide format for the models, that is each separate time point in its own column. Now we can call R code using BEGIN PROGRAM R to recreate the wide SPSS dataset in an R data frame.

*Recreate SPSS data in R data frame.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS(variables=c("Id","Count1","Count2",
                                                "Count3","Count4","Count5")) #grab data from SPSS
casedataMat <- as.matrix(casedata[,2:6]) #turn into matrix
#summary(casedata) #check contents
#casedataMat[1:10,1:5]
END PROGRAM.

Now to fit one model with 3 groups (without calculating the cross validation statistics) the code would be as simple as:

*Example estimating model with 3 groups and no CV.
BEGIN PROGRAM R.
library(crimCV)
crimCV(casedataMat,3,rcv=FALSE,dpolyp=3,dpolyl=3)
END PROGRAM.

But when we are estimating these group based trajectory models we don’t know the number of groups in advance. So typically one progressively fits more groups and then uses model selection criteria to pick the mixture solution that best fits the data. Here is a loop I created to successively estimate models with more groups and stuffs the models results in a list. It also makes a separate data frame that saves the model fit statistics, so you can see which solution fits the best (at least based on these statistics). Here I loop through estimates of 1 through 4 groups (this takes about 2 minutes in this example). Be warned – here are some bad programming practices in R (the for loops are defensible, but growing the lists within the loop is not – they are small though in my application and I am lazy).

*looping through models 1 through 4.
BEGIN PROGRAM R.
results <- c()  #initializing a set of empty lists to store the seperate models
measures <- data.frame(cbind(groups=c(),llike=c(),AIC=c(),BIC=c(),CVE=c())) #nicer dataframe to check out model 
                                                                            #model selection diagnostics
max <- 4 #this is the number of grouping solutions to check

#looping through models
for (i in 1:max){
    model <- crimCV(casedataMat,i,rcv=TRUE,dpolyp=3,dpolyl=3)
    results <- c(results, list(model))
    measures <- rbind(measures,data.frame(cbind(groups=i,llike=model$llike,
                                          AIC=model$AIC,BIC=model$BIC,CVE=model$cv)))
    #save(measures,results,file=paste0("Traj",as.character(i),".RData")) #save result
    }
#table of the model results
measures
END PROGRAM.

In my actual application the groups take a long time to estimate, so I have the commented line saving the resulting list in a file. Also if the model does not converge it breaks the loop. So here we see that the mixture with 3 groups is the best fit according to the CVE error, but the 2 group solution would be chosen by AIC or BIC criteria. Just for this tutorial I will stick with the 3 group solution. We can plot the predicted trajectories right within R by selecting the nested list.

*plot best fitting model.
BEGIN PROGRAM R.
plot(results[[3]])
#getAnywhere(plot.dmZIPt) #this is the underlying code
END PROGRAM.

Now the particular object that stores the probabilities is within the gwt attribute, so we can transform this to a data frame, merge in the unique identifier, and then use the STATS GET R command to grab the resulting R data frame back into an SPSS dataset.

*Grab probabiltiies back SPSS dataset.
BEGIN PROGRAM R.
myModel <- results[[3]] #grab model
myDF <- data.frame(myModel$gwt) #probabilites into dataframe
myDF$Id <- casedata$Id #add in Id
END PROGRAM.
*back into SPSS dataset.
STATS GET R FILE=* /GET DATAFRAME=myDF DATASET=TrajProb.

Then we can merge this R data frame into SPSS. After that, we can classify the observations into groups based on the maximum posterior probability of belonging to a particular group.

*Merge into original dataset, and the assign groups.
DATASET ACTIVATE OrigData.
MATCH FILES FILE = *
  /FILE = 'TrajProb'
  /BY ID
  /DROP row.names.
DATASET CLOSE TrajProb.
*Assign group based on highest prob.
*If tied last group wins.
VECTOR X = X1 TO X3.
COMPUTE #MaxProb = MAX(X1 TO X3).
LOOP #i = 1 TO 3.
  IF X(#i) = #MaxProb Group = #i.
END LOOP.
FORMATS Group (F1.0).

Part of the motivation for doing this is not only to replicate some of the work of Weisburd and company, but that it has utility for identifying long term hot spots. Part of what Weisburd (and similarly I am) finding is that crime at small places is pretty stable over long periods of time. So we don’t need to make up to date hotspots to allocate police resources, but are probably better off looking at crime over much longer periods to identify places for targeted strategies. Trajectory models are a great tool to help identify those long term high crime places, same as geographic clustering is a great tool to help identify crime hot spots.

15 Comments

by Andy Wheeler on August 12, 2014 • Permalink

Posted in R, SPSS

Tagged clustering, Crime Analysis, data-manipulation, r, SPSS, time-series

Posted by Andy Wheeler on August 12, 2014

https://andrewpwheeler.com/2014/08/12/estimating-group-based-trajectory-models-using-spss-and-r/

New paper – Testing for Randomness in Day of Week Crime Sprees with Small Samples

I’ve uploaded a new paper, Testing for Randomness in Day of Week Crime Sprees with Small Samples, to SSRN. The abstract is below:

This paper discusses exact tests for evaluating whether a series of offences are randomly distributed across days of the week for small sample sizes. The given context is if a crime analyst has identified a series of events that are committed by the same offender(s), can the analyst determine if those events are randomly distributed with respect to the day of week given only a few offences? I detail how one can develop exact reference distributions since the number of potential permutations are small. I also discuss the power of the chi^2 and Kuiper’s V test for small sample sizes, and give an example of hypothesis testing when crimes have uncertain days of occurrence.

Here is an example graph of the power of the tests under different alternate hypotheses of crimes only having probability of being committed on a certain subset of days in the week. Comments on the paper would be wonderful (I will have to bug someone to give me some pre peer-review feedback) – so feel free.

Aggregating values in time series charts

One common task I undertake in is to make time series graphs of crime counts, often over months or shorter time periods. Here is some example data to illustrate, a set of 20 crimes with a particular date in 2013.

*Make some fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP #i = 1 TO 20.
  COMPUTE #R = RV.UNIFORM(0,364).
  COMPUTE DateRob = DATESUM(DATE.MDY(1,1,2013),#R,"DAYS").
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
FORMATS DateRob (ADATE10).
EXECUTE.

SPSS has some convenient functions to aggregate right within GGRAPH, so if I want a chart of the number of crimes per month I can create my own Month variable and aggregate. The pasted GGRAPH code is generated directly though the Chart Builder GUI.

COMPUTE Month = XDATE.MONTH(DateRob).
FORMATS Month (F2.0).

*Default Line chart.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Month COUNT()[name="COUNT"] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Month=col(source(s), name("Month"), unit.category())
  DATA: COUNT=col(source(s), name("COUNT"))
  GUIDE: axis(dim(1), label("Month"))
  GUIDE: axis(dim(2), label("Count"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: line(position(Month*COUNT), missing.wings())
END GPL.

So at first glance that looks alright, but notice that the month’s do not start until 3. Also if you look close you will see a 5 is missing. What happens is that to conduct the aggregation in GGRAPH, SPSS needs to treat Month as a categorical variable – not a continuous one. SPSS only knows of the existence of categories contained in the data. (A similar thing happens in GROUP BY statements in SQL.) So SPSS just omits those categories.

We can manually specify all of the month categories in the axis. To reinforce where the measurements come from I also plot the points on top of the line.

*Line chart with points easier to see.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Month COUNT()[name="COUNT"] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Month=col(source(s), name("Month"), unit.category())
  DATA: COUNT=col(source(s), name("COUNT"))
  GUIDE: axis(dim(1), label("Month"))
  GUIDE: axis(dim(2), label("Count of Robberies"))
  SCALE: cat(dim(1), include("1","2","3","4","5","6","7","8","9","10","11","12"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: line(position(Month*COUNT), missing.wings())
  ELEMENT: point(position(Month*COUNT), color.interior(color.black), 
           color.exterior(color.white), size(size."10"))
END GPL.

So you can see that include statement with all of the month numbers. You can also see what that mysterious missing.wings() function actually does in this example. It is misleading though, as 5 isn’t missing, it is simply zero.

A simple workaround for this example is to just use a bar chart. A zero bar is not misleading.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Month COUNT()[name="COUNT"] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Month=col(source(s), name("Month"), unit.category())
  DATA: COUNT=col(source(s), name("COUNT"))
  GUIDE: axis(dim(1), label("Month"))
  GUIDE: axis(dim(2), label("Count of Robberies"))
  SCALE: cat(dim(1), include("1","2","3","4","5","6","7","8","9","10","11","12"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(Month*COUNT))
END GPL.

I often prefer line charts for several reasons though, often to superimpose multiple lines (e.g. I may want to put the lines for counts of crimes in 2012 and 2011 as well). Line charts are clearly superior to clustered bar charts in that situation. Also I prefer to be able to keep time is a numerical variable in the charts, and one can’t do that with aggregation in GGRAPH.

So I do the aggregation myself.

*Make a new dataset.
DATASET DECLARE AggRob.
AGGREGATE OUTFILE='AggRob'
  /BREAK = Month
  /CountRob = N.
DATASET ACTIVATE AggRob.

But we have the same problem here, in that months with zero counts are not in the data. To fill in the zeroes, I typically make a new dataset of the date ranges using INPUT PROGRAM and loops, same as I did to make the fake data at the beginning of the post.

*Make a new dataset to expand to missing months.
INPUT PROGRAM.
LOOP #i = 1 TO 12.
  COMPUTE Month = #i.
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME TempMonExpan.

Now we can simply merge this expanded dataset back into AggRob, and the recode the system missing values to zero.

*File merge back into AggRob.
DATASET ACTIVATE AggRob.
MATCH FILES FILE = *
  /FILE = 'TempMonExpan'
  /BY Month.
DATASET CLOSE TempMonExpan.
RECODE CountRob (SYSMIS = 0).

Now we can make our nice line chart with the zeros in place.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Month CountRob
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Month=col(source(s), name("Month"))
  DATA: CountRob=col(source(s), name("CountRob"))
  GUIDE: axis(dim(1), label("Month"), delta(1), start(1))
  GUIDE: axis(dim(2), label("Count of Robberies"), start(0))
  SCALE: linear(dim(1), min(1), max(12))
  SCALE: linear(dim(2), min(-0.5))
  ELEMENT: line(position(Month*CountRob))
  ELEMENT: point(position(Month*CountRob), color.interior(color.black), 
           color.exterior(color.white), size(size."10"))
END GPL.

To ease making these separate time series datasets I have made a set of macros, one named !TimeExpand and the other named !DateExpand. Both take a begin and end date and then make an expanded dataset of times. The difference between the two is that !TimeExpand takes a user specified step size, and !DateExpand takes a string of the types used in SPSS date time calculations. The situation in which I like to use !TimeExpand is when I do weekly aggregations from a specified start time (e.g. the weeks don’t start over at the beginning of the year). It also works for irregular times though, say if you wanted 15 minute bins. !DateExpand can take years, quarters, months, weeks, days, hours, minutes, and seconds. The end dates can also be system variables like $TIME as well. The macro can be found here, and it contains several examples within. Update: I have added a few macros that do the same thing for panel data. It just needs to take one numeric variable as the panel id, otherwise the argument for the macros are the same.

1 Comment

by Andy Wheeler on August 4, 2014 • Permalink

Posted in SPSS

Tagged data-manipulation, grammar of graphics, MACRO, SPSS

Posted by Andy Wheeler on August 4, 2014

https://andrewpwheeler.com/2014/08/04/aggregating-values-in-time-series-charts/

Using Google Fusion Tables to make some maps!

In the past to share interactive maps with others I’ve used BatchGeo and CartoDB. BatchGeo is super easy to geocode a few incidents, and CartoDB has a few more stylistic options (including some very cool animations). Both of these projects have a limit on the number of points you can map with the free service though. The new Google maps allows you make similar products to BatchGeo and CartoDB, in that you can upload a csv file or kml and then do some light editing of the points, and then embed an iframe in a website if you want (I wish Google Maps had a time slider like Google Earth does). Here is an example from my PhD of a few locations that one of my original models did a very poor job of predicting the amount of crime at the street midpoint or intersection.

But a few recent projects I wanted to place many more geographies on the map than these free versions allow. ArcGIS online is pretty slick in my few tests, but I am settling on Google Fusion tables for the ability to link the geographies and data tables (plus the ability to filter is very nice). Basically you can upload your data table and kml in seperate Fusion tables and then merge them to create your own polygons with associated data. Here is another example from my dissertation and embedded map below.

Basically what I do is make a set of units of analysis based on street mid-points and intersections. I then divide the city up based on the Thiessen polygons of those sets of points for the allocation of different areal measures. E.g. I can calculate the overlap of the Thiessen polygon with the area of sidewalks.

I’m using Google Fusion tables for some other projects in which I want people within the PD to be able to interactively explore the data. My main interest in these slippy maps are that you can pan and zoom – and with a static map it is hard to recreate all of the potential views a consumer of the map wants. I can typically make a nicer overview map of the forest or any general data patterns in a static map, but if I think the user of the map will want to zoom in to particular locations these interactive maps meet that challenge. Pop-ups allow for a brief digging into the data as well, but don’t allow for visualizing patterns. Fusion tables are very limited though it the styling of the geography. (All of these free versions are pretty limited, but the Fusion tables are especially restrictive for point symbology and creating choropleth classes).

Using these maps has a trade off when sharing with the PD though. They are what I would call semi-public, in that if you want others to be able to view the map you can share a link, but anyone with the link can see the map. This prevents sharing of intimate information on the map that might be possibly leaked. (For the ability to have access control to more sensitive information, e.g. a user has to sign on to a secure website, I know Bair analytics offers paid for products like that – probably some of the prior web map apps I mentioned do so as well.) I’ve made them in the past for Troy P.D., but I really have no idea how often they were used – so other analysts let me know in the comments if you’ve had success with maps like these disseminating info. within the police department.

I’m getting devilishly close to finishing my dissertation, and I will post an update and link when the draft is complete. My prospectus can be seen here, and the linked maps are part of some supplemental material I compiled. The supplemental info. should provide a little more details on what the maps are showing.

Smoothed regression plots for multi-level data

Bruce Weaver on the SPSS Nabble site pointed out that the Centre for Multilevel Modelling has added some syntax files for multilevel modelling for SPSS. I went through the tutorials (in R and Stata) a few years ago and would highly recommend them.

Somehow following the link trail I stumbled on this white paper, Visualising multilevel models; The initial analysis of data, by John Bell and figured it would be good fodder for the the blog. Bell basically shows how using smoothed regression estimates within groups is a good first step in data analysis of complicated multi-level data. I obviously agree, and previously showed how to use ellipses to the same effect. The plots in the Bell whitepaper though are very easy to replicate directly in base SPSS syntax (no extra stat modules or macros required) using just GGRAPH and inline GPL.

For illustration purposes, I will use the same data as I did to illustrate ellipses. It is the popular2.sav sample from Joop Hox’s book. So onto the SPSS code; first we will define a FILE HANDLE for where the popular2.sav data is located and open that file.

FILE HANDLE data /NAME = "!!!!!!Your Handle Here!!!!!".
GET FILE = "data\popular2.sav".
DATASET NAME popular2.

Now, writing the GGRAPH code that will follow is complicated. But we can use the GUI to help us write the most of it and them edit the pasted code to get the plot we want in the end. So, the easiest start to get the graph with the regression lines we want in the end is to navigate to the chart builder menu (Graphs -> Chart Builder), and then create a scatterplot with extrav on the x axis, popular on the y axis, and use class to color the points. The image below is a screen shot of this process, and below that is the GGRAPH code you get when you paste the syntax.

*Base plot created from GUI.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=extrav popular class[LEVEL=NOMINAL] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: extrav=col(source(s), name("extrav"))
  DATA: popular=col(source(s), name("popular"))
  DATA: class=col(source(s), name("class"), unit.category())
  GUIDE: axis(dim(1), label("extraversion"))
  GUIDE: axis(dim(2), label("popularity sociometric score"))
  GUIDE: legend(aesthetic(aesthetic.color.exterior), label("class ident"))
  ELEMENT: point(position(extrav*popular), color.exterior(class))
END GPL.

Now, we aren’t going to generate this chart. With 100 classes, it will be too difficult to identify any differences between classes unless a whole class is an extreme outlier. Here I am going to make several changes to generate the linear regression line of extraversion on popular within each class. To do this we will make some edits to the ELEMENT statement:

replace point with line
replace position(extrav*popular) with position(smooth.linear(extrav*popular)) – this tells SPSS to generate the linear regression line
replace color.exterior(class) with split(class) – the split modifier tells SPSS to generate the regression lines within each class.
make the regression lines semi-transparent by adding in transparency(transparency."0.7")

Extra things I did for aesthetics:

I added jittered points to the plot, and made them small and highly transparent (these really aren’t necessary in the plot and are slightly distracting). Note I placed the points first in the GPL code, so the regression lines are drawn on top of the points.
I changed the FORMATS of extrav and popular to F2.0. SPSS takes the formats for the axis in the charts from the original variables, so this prevents decimal places in the chart (and SPSS intelligently chooses to only label the axes at integer values on its own).
I take out the GUIDE: legend line – it is unneeded since we do not use any colors in the chart.
I change the x and y axis labels, e.g. GUIDE: axis(dim(1), label("Extraversion")) to be title case.

*Updated chart with smooth regression lines.
FORMATS extrav popular (F2.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=extrav popular class[LEVEL=NOMINAL] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: extrav=col(source(s), name("extrav"))
  DATA: popular=col(source(s), name("popular"))
  DATA: class=col(source(s), name("class"), unit.category())
  GUIDE: axis(dim(1), label("Extraversion"))
  GUIDE: axis(dim(2), label("Popularity Sociometric Score"))
  ELEMENT: point.jitter(position(extrav*popular), transparency.exterior(transparency."0.7"), size(size."3"))
  ELEMENT: line(position(smooth.linear(extrav*popular)), split(class), transparency(transparency."0.7"))
END GPL.

So here we can see that the slopes are mostly positive and have intercepts varying mostly between 0 and 6. The slopes are generally positive and (I would guess) around 0.25. There are a few outlier slopes, and given the class sizes do not vary much (most are around 20) we might dig into these outlier locations a bit more to see what is going on. Generally though with 100 classes it doesn’t strike me as very odd as some going against the norm, and a random effects model with varying intercepts and slopes seems reasonable, as well as the assumption that the distribution of slopes are normal. The intercept and slopes probably have a slight negative correlation, but not as much as I would have guessed with a set of scores that are so restricted in this circumstance.

Now the Bell paper has several examples of using the same type of regression lines within groups, but using loess regression estimates to assess non-linearity. This is really simple to update the above plot to incorporate this. One would simply change smooth.linear to smooth.loess. Also SPSS has the ability to estimate quadratic and cubic polynomial terms right within GPL (e.g. smooth.cubic).

Here I will suggest a slightly different chart that allows one to assess how much the linear and non-linear regression lines differ within each class. Instead of super-imposing all of the lines on one plot, I make a small multiple plot where each class gets its own panel. This allows much simpler assessment if any one class shows a clear non-linear trend.

Because we have 100 groups I make the plot bigger using the PAGE command. I make it about as big as can fit on my screen without having to scroll, and make it 1,200^2 pixels. (Also note when you use a PAGE: begin command you need an accompanying PAGE: end() command.
For the small multiples, I wrap the panels by setting COORD: rect(dim(1,2), wrap()).
I strip the x and y axis labels from the plot (simply delete the label options within the GUIDE statements. Space is precious – I don’t want it to be taken up with axis labels and legends.
For the panel label I place the label on top of the panel by setting the opposite option, GUIDE: axis(dim(3), opposite()).

WordPress in the blog post shrinks the graph to fit on the website, but if you open the graph up in a second window you can see how big it is and explore it easier.

*Checking for non-linear trends.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=extrav popular class[LEVEL=NOMINAL] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(1200px,1200px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: extrav=col(source(s), name("extrav"))
  DATA: popular=col(source(s), name("popular"))
  DATA: class=col(source(s), name("class"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2))
  GUIDE: axis(dim(3), opposite())
  ELEMENT: line(position(smooth.linear(extrav*popular*class)), color(color.black))
  ELEMENT: line(position(smooth.loess(extrav*popular*class)), color(color.red))
  PAGE: end()
END GPL.

The graph is complicated, but with some work one can go group by group to see any deviations from the linear regression line. So here we can see that most of the non-linear loess lines are quite similar to the linear line within each class. The only one that strikes me as noteworthy is class 45.

Here there is not much data within classes (around 20 students), so we have to wary of small samples to be estimating these non-linear regression lines. You could generate errors around the linear and polynomial regression lines within GPL, but here I do not do that as it adds a bit of complexity to the plot. But, this is an excellent tool if you have many points within your groups and it can be amenable to quite a large set of panels.

Jittered scatterplots with 0-1 data

Scatterplots with discrete variables and many observations take some touches beyond the defaults to make them useful. Consider the case of a categorical outcome that can only take two values, 0 and 1. What happens when we plot this data against a continuous covariate with my default chart template in SPSS?

Oh boy, that is not helpful. Here is the fake data I made and the GGRAPH code to make said chart.

*Inverse logit - see.
*https://andrewpwheeler.wordpress.com/2013/06/25/an-example-of-using-a-macro-to-make-a-custom-data-transformation-function-in-spss/.
DEFINE !INVLOGIT (!POSITIONAL  !ENCLOSE("(",")") ) 
1/(1 + EXP(-!1))
!ENDDEFINE.

SET SEED 5.
INPUT PROGRAM.
LOOP #i = 1 TO 1000.
  COMPUTE X = RV.UNIFORM(0,1).
  DO IF X <= 0.2.
    COMPUTE YLin = -0.5 + 0.3*(X-0.1) - 4*((X-0.1)**2).
  ELSE IF X > 0.2 AND X < 0.8.
    COMPUTE YLin = 0 - 0.2*(X-0.5) + 2*((X-0.5)**2) - 4*((X-0.5)**3).
  ELSE.
      COMPUTE YLin = 3 + 3*(X - 0.9).
  END IF.
  COMPUTE #YLin = !INVLOGIT(YLin).
  COMPUTE Y = RV.BERNOULLI(#YLin).
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME NonLinLogit.
FORMATS Y (F1.0) X (F2.1).

*Original chart.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"))
  ELEMENT: point(position(X*Y))
END GPL.

So here we will do a few things to the chart to make it easier to interpret:

jitter the points slightly on the Y-axis so they don’t overlap
draw the points smaller and slightly transparent

SPSS can jitter the points directly within GGRAPH code (see point.jitter), but here I jitter the data slightly myself a uniform amount. The extra aesthetic options for making points smaller and semi-transparent are at the end of the ELEMENT statement.

*Making a jittered chart.
COMPUTE YJitt = RV.UNIFORM(-0.04,0.04) + Y.
FORMATS Y YJitt (F1.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y YJitt
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: YJitt=col(source(s), name("YJitt"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"), delta(1), start(0))
  SCALE: linear(dim(2), min(-0.05), max(1.05))
  ELEMENT: point(position(X*YJitt), size(size."3"), 
           transparency.exterior(transparency."0.7"))
END GPL.

If I made the Y axis categorical I would need to use point.jitter in the inline GPL code because SPSS will always force the categories to the same spot on the axis. But since I draw the Y axis as continuous here I can do the jittering myself.

A useful tool for exploratory data analysis is to add a smoothing term to plot – a local estimate of the mean at different locations of the X-axis. No binning necessary, here is an example using loess right within the GGRAPH call. The red line is the smoother, and the blue line is the actual proportion I generated the fake data from. It does a pretty good job of identifying the discontinuity at 0.8, but the change points earlier are not visible. Loess was originally meant for continuous data, but for exploratory analysis it works just fine on the 0-1 data here. See also smooth.mean for 0-1 data.

*Now adding in a smoother term.
COMPUTE ActualFunct = !INVLOGIT(YLin).
FORMATS Y YJitt ActualFunct (F2.1).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y YJitt ActualFunct
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: YJitt=col(source(s), name("YJitt"))
  DATA: ActualFunct=col(source(s), name("ActualFunct"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"), delta(0.2), start(0))
  SCALE: linear(dim(2), min(-0.05), max(1.05))
  ELEMENT: point(position(X*YJitt), size(size."3"), 
           transparency.exterior(transparency."0.7"))
  ELEMENT: line(position(smooth.loess(X*Y, proportion(0.2))), color(color.red))
  ELEMENT: line(position(X*ActualFunct), color(color.blue))
END GPL.

SPSS’s default smoothing is alittle too smoothed for my taste, so I set the proportion of the X variable to use in estimating the mean within the position statement.

I wish SPSS had the ability to draw error bars around the smoothed means (you can draw them around the linear regression lines with quadratic or cubic polynomial terms, but not around the local estimates like smooth.loess or smooth.mean). I realize they are not well defined and rarely have coverage properties of typical regression estimators – but I rather have some idea about the error than no idea. Here is an example using the ggplot2 library in R. Of course we can work the magic right within SPSS.

BEGIN PROGRAM R.
#Grab Data
casedata <- spssdata.GetDataFromSPSS(variables=c("Y","X"))
#ggplot smoothed version
library(ggplot2)
library(splines)
MyPlot <- ggplot(aes(x = X, y = Y), data = casedata) + 
          geom_jitter(position = position_jitter(height = .04, width = 0), alpha = 0.1, size = 2) +
          stat_smooth(method="glm", family="binomial", formula = y ~ ns(x,5))
MyPlot
END PROGRAM.

To accomplish the same thing in SPSS you can estimate restricted cubic splines and then use any applicable regression procedure (e.g. LOGISTIC, GENLIN) and save the predicted values and confidence intervals. It is pretty easy to call the R code though!

I haven’t explored the automatic linear modelling, so let me know in the comments if there is a simply way right in SPSS to get explore such non-linear predictions.

3 Comments

by Andy Wheeler on July 3, 2014 • Permalink

Posted in Data Visualization, R, SPSS

Tagged data visualization, ggplot2, r, scatterplot, SPSS

Posted by Andy Wheeler on July 3, 2014

https://andrewpwheeler.com/2014/07/03/jittered-scatterplots-with-0-1-data/

Poisson regression and crazy predictions

Here is a problem I’ve encountered a few times in my own work (and others) with Poisson regression models and the exponential link function. It came up recently in some discussions on the scatterplot blog by Jeremy Freese (see 1 & 2) critiquing the PNAS paper on the effect of female named hurricanes on death tolls, so I figured I would expand up those thoughts a little here.

So the problem is when you estimate a Poisson regression model is that the exponential link function can become explosive for explanatory variables that have a large range. So to be clear, we have a Poisson regression model of the form (here E[Y] mean the expected value of Y):

log(E[Y]) = B1*(X)
    E[Y]  = e^(B1*X)

If X has a small range this may be fine, but if X has a large range it can become problematic. Consider if Y are hurricane deaths, X is monetary damage of the hurricane, and B1 = 0.01. Lets say the monetary damage ranges from 1 to 1000 (imagine these are in thousands of dollars, so range between $1,000 and $1 million). What happens with the predictions?

E[Deaths] = e^(0.01*   1) =     1.01
E[Deaths] = e^(0.01*   5) =     1.05
E[Deaths] = e^(0.01*  10) =     1.11
E[Deaths] = e^(0.01*  50) =     1.65
E[Deaths] = e^(0.01* 100) =     2.71
E[Deaths] = e^(0.01* 500) =   148
E[Deaths] = e^(0.01*1000) = 22026

These predictions are invariant to linear transformations – that is Z-scoring X in the original units doesn’t change the predictions (the same as expressing X in [dollars/1000] doesn’t make any difference than just by including [X] on the right hand side). The linear predictor of B1 will simply be scaled by the appropriate inverse transformation. Also I’d note that expressed in terms of incident rate ratios the effect would be e^0.01=1.01. This appears on its face a totally innocuous effect, and only in consideration of the variation in X does it appear to be absurd.

You can see that if the range of X were smaller, say between 1 and 100, the predictions might be fine. The predictions between 1 and 100 only vary by 1.7 deaths. The problem with these explosive predictions at larger values is that they are nonsense for most of social scientific research. A simple sanity check to see if this is occurring is to check the predicted value from your Poisson regression equation at the low end of X versus the high end (and just pretend all of the other explanatory variables are set to 0) exactly as I have done here. If the high end is crazy, you will need to consider some alternative model specification (or be very clear that the model can not be extrapolated to the larger values of X).

A useful alternate parametrization is simply to log X, and in this case when exponentiating the right hand side, it will make the predictor a power of the original metric.

So imagine we fit the model:

log(E[Y]) = B2*(log(X))
    E[Y]) = e^(B2*log(X)) 
          = x^B2

Lets say here that B2 = 0.5. What happens to our predictions again?

E[Deaths] =    1^0.5 =  1
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

Those predictions look a little bit easier to swallow at the larger ranges. Notice also the differences in predictions in the smaller stages? There is more discrimination for the smaller values than on the original scale, but the larger values are suppressed. Lets consider the predictions side by side for easier comparison.

   X      B1   B2
   -      --   --
   1       1    1
   5       1    2
  10       1    3
  50       2    7
 100       3   10
 500     148   22
1000   22026   32

A frequent problem with logging the explanatory variables is that they contain zeroes. A simple alternative is to treat log(0) as 0 and then have a separate dummy variable equal to 1 when X = 0. This model may not make Occam happy, as it implies a discontinuity at 0, but it is in my opinion a small price to pay. Also if there are a lot of zeroes this doesn’t strike me as totally unrealistic to have a mixture of what happens at 0 and then what happens at the higher values. So the full model written out would be:

log(E[Y]) = B3*D + B4*(log(X))

But the model is essentially discontinuous. When X=0, we treat log(X)=0 and D=1, so the model reduces to;

log(E[Y]) = B3*D :When X = 0

When X>0, D=0 and the model reduces to:

log(E[Y]) = B4*(log(X)) :When X > 0

Now, it certainly would be weird if B3>>0, as this would imply a high spike at 0, and then at the X value of 1 Y goes back down to 1 and then increases with X. If we expect B4 to be positive, then a negative value of B3 (or very close to 0) would make the most sense. It is still a discontinuity in the function, but one that may make theoretical sense. So imagine we fit the equation log(E[Y]) = B3*D + B4*(log(X)), lets say B4 is 0.5 (the same as B2), and that B3 is equal to -0.1. This would then make the set of predictions go:

E[Deaths] =  e^-0.01 =  0.9
E[Deaths] =    1^0.5 =  1.0
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

So in this made up example the discontinuity pretty much fits right in with the rest of the function. We may consider other non-linear transformations of X as well (splines or higher powers) but frequently an additional problem is that the bulk of the data lie in the lower end of the range. So for our dollars if it was highly right skewed, there may only be a few values at 100 or higher. These can be highly influential if you use powers of X (e.g. include X^2, X^3 etc. on the right hand side) – so splines are a better choice – but essentially no matter how you fit the function it will be hard to verify the fit at these values or extrapolate to those tails. So the fit in the original function may be fine – but it just implies unrealistic marginal effects in the tails.

So how do we verify one equation over the other? Visualizing count data in scatterplots tend to be harder than visualizing continuous data, especially if there is a stock pile of data at 0. The problem is simply exacerbated if the explanatory variable has a similar right skew, there will be a large mass near the origin of the plot and very sparse everywhere else.

My simple suggesting is to just bin the data at X values, which is very simple if X is integer valued, and then plot the mean and standard error of the Y value within those bins. As Poisson regression and its variants rely on asymptotic properties, if the error bars are too variable to deduce a pattern you should be concerned your sample size isn’t large enough to begin with.

If the bulk of the data only have a small range over X, then it will be hard in practice to differentiate between the two model parametrizations I suggest here (with the typical noisy data we have in the social sciences). So you may prefer logging the X variable simply to prevent the dramatic explosions in the tails of the data right from the start.

I do feel comfortable saying that if the ratio of the smallest to largest value for the independent variable is over 100, you should check the predictions of the exponential link function very closely (if the smallest value is 0 just estimate the ratio as if the smallest value is 1). If this ratio is 100 or larger, unless the linear predictor in the Poisson regression equation is very small (<<.01) the predictions may explode into very implausible ranges for the larger X values.

2 Comments

by Andy Wheeler on June 17, 2014 • Permalink

Posted in Regression, scholarly

Tagged Poisson, scholarly

Posted by Andy Wheeler on June 17, 2014

https://andrewpwheeler.com/2014/06/17/poisson-regression-and-crazy-predictions/

What is up with 3d graphics for book covers?

The other day in Google books I noticed Graphics for Statistics and Data Analysis with R by Kevin Keen in the related book section. What caught my eye was not the title (there have to be 100+ related R books at this point) but the really awful 3d pie chart.

Looking at the preview on google books this appears to be an unfortunate substitution. The actual cover has a much more reasonable set of surface plots and other online book stores (e.g. Amazon) appear to have the correct cover.

I suspect someone at CRC Press used some stock imagery for the cover, and unfortunately the weird 3d pie graph has been propagated to the google book preview without correction.

This reminded me of a few other book covers in cartography and data visualization though that I find less than appealing. Now, I’m not saying here to judge a book by its cover, and I have not read all of the books I will point to here. But I find the use of 3d graphics in book covers in the data visualization field to be strange and bordering cognitive dissonance with the advice most of the authors give.

First I’ll start with a book I have read, and would suggest to everyone, Thematic cartography and geographic visualization by Slocum et al. I have the 2005 version, and it is dawned by this 3d landscape. (Sorry this is the largest image I can find online – other editions I believe have different covers.)

The multivariate display of data is admirable – so for exploratory analysis you could make a reasonable argument for the use of proportional sized circles superimposed on the choropleth map. The use of 3d in this circumstance though is gratuitous, and the extreme perspective hides much of the data while highlighting the green hills in the background.

The second mapping book I have slight reservations about critiquing the cover (I am on the job market!). I have not read the book, so I can not say anything about its contents. But roaming the book displays at an ASC conference I remember seeing this cover, GIS and Spatial Analysis for the Social Sciences: Coding, Mapping, and Modeling by Nash and Asencio.

This probably should not count in the other 3d graphics I am showing. The bar columns do have shading for 3d perspective – but the map otherwise is 2d. But the spectral color scheme is an awful choice. The red in the map tends to stand out the most – which places with zero crimes I don’t think you want to make that impression. The choropleth colors appear to be displaying the same data as the point data. The point data are so clustered that the choropleth can only be described as misleading – which may be a good point in text for side by side maps – but on the cover? Bar locations seem to be unrelated (as we might expect for juvenile crime) but they are again aggregated to the (probably) census units – making me question if the aggregation obfuscates the relationship. Bars are not available from the census – so it is likely this aggregation was intentional. I have no idea about the content of the book and I will likely get it and do an overall review of all crime mapping books sometime. But the cover is unambiguously a bad map.

The last book cover with 3d graphics (related to data-visualization) that I immediately remembered was R For Dummies by Meys and de Vries.

Now this when you look close really is not bad. It is not a graph on the cover, but a set of winding, hexagon cylinder stairsteps. So the analogy of taking small steps is fine – but the visual similarity to other statistical 3d graphics is clear. Consider the SPSS For Dummies book by Griffith.

Now that is an intentional, 3d chart made up of tiny blocks, with a trend line if you look closely, shadowed by cigarette like red bars in the background. At least this is so strange (and not possible in statistical software) that this example would never be confused with an actual reasonable statistical graphic. The Dummies series has such brand recognition as well that the dominant part of the cover might be the iconic yellow and type, as opposed to the inset graphic.

Not wanting to leave other software out of the loop, I looked for examples for SAS and Stata. SAS has a reasonable 3d cover in SAS System for Statistical Graphics by Friendly.

Short sidetrack story: I first learned statistical programming using SAS back in undergrad days at Bloomsburg University. Default graphics for SAS at that point (04-08) I believe were still the ASCII art looking things (at least that is what I remember). During our last meeting for my last statistics class – one of the other students showed me you could turn on the ODS output to html – and tables and graphs were by default pretty nice. I since have not had a need to use SAS.

This 3d cover by Friendly is arguably a reasonable use of 3d. 3d graphs are hard to navigate, and the use the anchors connecting the observations to the non-linear surface more easily associate a point with below or above the surface. It is certainly difficult though to understand the fit of the function – so likely a series of bivariate graphs would be more intuitive – especially given the meager number of points. I suspect the 3d on the cover is for the same reason 3d graphics were used in the other covers – because it looks cooler to book marketers!

Stata managed to debunk the 3d graph trends – I could not find any example Stata books with 3d graphics. Nick Cox’s newer collection of his Speaking Stata series though has some interesting embellishments.

While in isolation all of the graphs are fine – I’m sure Cox would not endorse the gratuitous use of color gradients in the graphics (I don’t think svg like gradients like that are even possible in Stata graphics). The ternary diagrams show nothing but triangles as well – so I don’t think such gradients are a good idea in any case for simply the background of the plot. Such embellishments could actually decode data, but in the case of bar graphs do not likely hurt or help with understanding the plot. When such gradients are used as the background though they likely compete with the actual data in the plot. Stata apparently can do 3d graphs – so I might suggest I write a book on crime modelling (published by Stata press) and insert a 3d graph on the cover (as this is clearly a niche in the market not currently filled!) I might have to make room for Chernoff faces somewhere on the front or back cover as well.

So maybe I am just seeing things in the examples of 3d covers. If anyone has any insight into how these publishers choose the covers let me know – or if you have other examples of bad book cover examples of data vizualization! Since most of my maps and graphs are pretty dull in 2d I might just outsource the graphic design if I made a book.

1 Comment

by Andy Wheeler on June 16, 2014 • Permalink

Posted in Data Visualization, Mapping, scholarly

Tagged cartography, data visualization, mapping, scholarly

Posted by Andy Wheeler on June 16, 2014

https://andrewpwheeler.com/2014/06/16/what-is-up-with-3d-graphics-for-book-covers/

Learning to write badly in the social sciences

I recently finished Michael Billig’s book, Learn to Write Badly: How to Succeed in the Social Sciences, and I was largely convinced of Billig’s thesis so I shall reiterate it briefly here. Billig argues that much of social scientific writing is difficult to understand because of the excessive use of nouns instead of verbs. If we (ironically) use the word nominalization to describe the process of turning verbs into nouns, it would sound pretty similar to old hat of avoiding the use of jargon in scientific writing.

Billig goes a bit further though then the usual avoid jargon advice (which is uncontroversial), but gives many examples of where this change from verbs to nouns has negative consequences on how writing is interpreted. Two of these are:

Verbs are often much more clear about what actor is performing what actions (and in turning the verb to a noun both the actor and the action can become ambigious)
Replacing verbs with nouns gives a false sense of authority that the noun actually exists

The first is important for social scientists because we are pretty much always describing the actions of humans. The second Billig likens to marketing strategies to promote ones work (similar to how advertisements promote products), which I imagine the analogy turns a few academics stomachs.

As an example from my own work, I will use the title to one of my papers, The Moving Home Effect: A Quasi Experiment Assessing Effect of Home Location on the Offence Location. The title is really awful, and in a bit of self-deprecation the few times I presented the work I would make fun of my title making skills at the opening of my talk. The first part of the title before the semi-colon, "The Moving Home Effect", is an example of an ambiguous use of nouns. First, describing my findings as the moving home effect is rather ambiguous, it could mean effecting anything and everything. My particular study is much more restricted, I examined the distance between crimes before and after offenders move. Second, the effect of the added distance between crimes when moving I found to be rather small, which is probably one of the more interesting points of the paper. So saying "The Moving Home Effect" places unwanted emphasis that it exists and is real.

The same exercise can be used for the second part of the title. The use of quasi-experiment is simply econometrics jargon and is unneeded. It is an appeal to associate with a particular camp of analysis, and really does nothing to describe the nature of the work. So, I propose possible rewrites of my title to be:

Moving one’s home slightly changes the average distance between offences
When an offender moves, it slightly changes the location of where they offend

These are much more descriptive (and shorter) than the original title. Reviewing my own work such examples are rampant, so I have a bit of work to do to live up to Billig’s ideal, but I am convinced it will lead to improved writing. Also it seems to me it is a good exercise to make ones writing more concise, which is always welcome.

1 Comment

by Andy Wheeler on June 10, 2014 • Permalink

Posted in scholarly

Tagged scholarly, writing

Posted by Andy Wheeler on June 10, 2014

https://andrewpwheeler.com/2014/06/10/learning-to-write-badly-in-the-social-sciences/

In (slight) defense of word clouds

Word cloud of the frequency of words in Dr. Seuss’s Cat in the Hat via Jason Davies D3.js app.

I try to view the practice of data visualization with an open mind. In particular, I like to think that there is rarely a strict dichotomy of good or bad visualizations (or synonymously maps). It is a continual gradation, but we can use knowledge of visual perception to guide us as to visualizations that will be better for communicating particular information. Here I want to spend a few minutes talking about word clouds, and provide some things that they do good along with the bad.

So first, I will be focusing solely on the presentation of data in the form of the word cloud. Jacob Harris has a wonderful rant about why he hates word clouds, and this is focused on the fact that many word clouds are devoid of real meaning. You can present meaningless content in any graphical form you want – so I will not discuss this further.

When we evaluate the utility of any particular graphic, one needs to be clear about the motivation for the graphic – what data the graphic is intended to display to its audience. With that goal in mind, then we can evaluate how well the graphic meets that goal. If the goal of a word cloud is intended as a graphical depiction of the entire distribution of words it is obviously inferior to a bar chart (or with a long tail a chart of the empirical CDF). The words in a word cloud tend to be all jumbled up and in no particular order, so it is difficult to see the distribution. If the goal is a quantitative estimate of the difference between two words, they also fail here because we know that area judgements are much more difficult than comparing lengths along an aligned axis. In addition, there is another confound in that the length of the word (or even particular letters with ascenders or descenders – depending on the font used) further confound the size differences between the words in the word cloud. That is, even if the font sizes are in the end the same, SomeReallyBigLongWord will appear larger in the graphic than a word such as tiny. Again bar charts are clearly superior for either of these tasks, and Marti Hearst has a nice PDF white paper discussing these short comings (via Stephen Few’s Perceptual Edge site), Whats up with Tag clounds. Also see Stephen Few’s critique of the use of packed circles in Tableu for a similar critique.

Given the shortcomings, there are two potential aspects of word clouds that I personally find appealing in terms of communicating information. They both have to do with focusing the attention on particular words, as opposed to focusing on the empirical distribution of words or on the size difference between the two words. The first aspect I would refer to the Stroop effect. Stroop’s experiment showed that reaction time for reading words that provided interfering mental stimulus slowed reaction time. Or more simply, it is easier to read red and blue than it is to read green and orange. So what does this have to do tag clouds? In tag clouds, the differentially sized words themselves are the data. There is no cognitive burden as in bar charts because one need not navigate between the label and the word.

The second aspect is related to the fact that in a particular graphic, there are certain elements that will dominate the readers attention. This is similar to creating a layering in a chart (that is often discussed in cartography), e.g. make elements you want to focus attention on in the chart come to the foreground, and other elements recede to the background. For one example Stephen Kosslyn argues that bar charts are preferable to Cleveland dot plots for presentations, as the bars have a higher visual recognition (e.g. stand out more or are more salient). Word clouds do this very well for the bigger words, while bar charts tend to be more appropriate for visualizing the entire distribution. That is the bigger words aren’t immediately more salient in the graph – the actual bars displaying the data are.

Based on these, I would suspect a user study of word clouds would outperform bar charts in the following aspects in terms of time taken to perform the task:

Return the top three words in the graph.
Given two particular words, which word is larger than the other word.

Or more simply, even though when evaluating areas we are not terribly effective at assigning areas, we can still rank order based on the size of the words. So in terms of rank order tasks I think word clouds will perform alright for simple comparisons. Given no time constraints, I would expect bar charts will always be more accurate, but depending on the nature of the task will take longer. For the top three words, if the chart is ordered one would need to read the top three and then reply. If the chart was unordered, one needs to take time to find them (this is where the interference and the Stroop effect comes into play). Most words clouds I see these top words are pretty much immediately recognizable, and so I suspect will be much faster for these tasks.

For the two particular words, I suspect it will boil down to in what format you can find the words the fastest. In the bar chart, you have to scan a list, whereas in a tag cloud it is unordered. With the exception of very tiny words in the word cloud, I bet you will find each word faster in a word cloud in most examples. (This is my guess, I know of no experiments confirming this specific scenario.)

So what does this mean in terms of visualizing data – either in the form of a tag/word cloud or a more typical form like a bar chart? For the bar chart, the bar is what is the most salient feature of the graph. If you want to draw more attention to the particular words, while still leaving the data intact, consider direct labels of the bars or maybe a more table like graphic. As always, if you can selectively filter out a larger amount of words, it will provide a simpler graphic to evaluate for the remaining items.

Can the tag cloud ever be a reasonable data visualization compared to a simple table or bar graph? I believe it can, but only if the typical layout algorithms are improved to incorporate ancillary information. Tag clouds are very similar to bin packing algorithms, and strike me as natively similar to force directed network layout algorithms (such as Dorling Cartograms). The layouts as used by Wordle or the Jason Davies implementation simply place the words by a convenient algorithm in a rank order importance. They pretty much just take the biggest word, place it, then go around in a circle to find a place for the next biggest word.

We can think of other ways to lay out the words though. CiteULike has a wonderful example where the words are laid out in alphabetical order, you can see my entire library tags here. Here is a screenshot of my tag library on the homepage, in which it is a smaller space, so they filter out the small tags:

The motivation for this is not to view the empirical distribution of the tags, but is for look up of particular tags. You can click on a tag and it brings up all of my entries in my library with that tag. If you don’t want to navigate to an additional site, simply take a look to the right aside of this WordPress template I use. There is a library of the tags I use on this site ordered alphabetically and filtered, but with the font sized in proportion to the amount of tags used in posts on this site. I find these uses of word clouds quite convenient, where I rather give priority to the words themselves, over specific quantitative estimates of the frequencies.

There are other potential ways to layout the words though that can hold other quantitative information. For instance, I could layout the tags in my CiteULike library according to shared articles. Coloring the words and the typical layouts in random order I find distracting, but given particular tasks (such as simple look up) I find they work just fine. Again because the focus is on the word, and not exact quantitative assessment of the magnitude (nor understanding of the overall distribution) I say tag clouds are better in this particular circumstance than a bar chart is. My arguments based on cognitive load and speed of surveying word clouds certainly does not make them appropriate for all data viz. tasks, but I believe it does for some.

1 Comment

by Andy Wheeler on May 22, 2014 • Permalink

Posted in Data Visualization

Tagged data visualization, word-cloud

Posted by Andy Wheeler on May 22, 2014

https://andrewpwheeler.com/2014/05/22/in-slight-defense-of-word-clouds/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 399 other subscribers
aoristic big-data cartography census choropleth citeulike color consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts by Andy Wheeler

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange