Bean plots in SPSS

It seems like I have come across alot of posts recently about visualizing univariate distributions. Besides my own recent blog post about comparing distributions of unequal size in SPSS, here are a few other blog posts I have recently come across;

Nathan Yau has a recent post on his flowing data blog, How to Visualize and Compare Distributions.
Lucas Fenu on his blog gRaphics! has a post, Plotting data and distribution simultaneously (with ggplot2).
Not a blog post, but Hadley Wickham & Lisa Stryjewski have a working paper that will be of interest, 40 years of box plots.

Such a variety of references is not surprising though. Examining univariate distributions is a regular task for data analysis and can tell you alot about the nature of data (including potential errors in the data). Here are some posts on the Cross Validated Q/A site of related interest I have compiled;

In particular the recent post on bean plots and Luca Fenu’s post motivated my playing around with SPSS to produce the bean plots here. Note Jon Peck has published a graphboard template to generate violin plots for SPSS, but here I will show how to generate them in the usual GGRAPH commands. It is actually pretty easy, and here I extend the violin plots to include the beans suggested in bean plots!

A brief bit about the motivation for bean plots. Besides consulting the article by Peter Kampstra, one is interested in viewing a univariate continuous distribution among a set of different categories. To do this one uses a smoothed kernel density estimate of the distribution for each of the subgroups. When viewing the smoothed distribution though one loses the ability to identify patterns in the individual data points. Patterns can mean many things, such as outliers, or patterns such as striation within the main body of observations. The bean plot article gives an example where striation in measurements at specific inches can be seen. Another example might be examining the time of reported crime incidents (they will have bunches at the beginning of the hour, as well as 15, 30, & 45 minute marks).

Below I will go through a brief series of examples demonstrating how to make bean plots in SPSS.

SPSS code to make bean plots

First I will make some fake data for us to work with.

******************************************.
set seed = 10.
input program.
loop #i = 1 to 1000.
compute V1 = RV.NORM(0,1).
compute groups = TRUNC(RV.UNIFORM(0,5)).
end case.
end loop.
end file.
end input program.
dataset name sim.
execute.

value labels groups
0 'cat 0'
1 'cat 1'
2 'cat 2'
3 'cat 3'
4 'cat 4'.
******************************************.

Next, I will show some code to make the two plots below. These are typical kernel density estimates of the V1 variable I made for the entire distribution, and these are to show the elements of the base bean plots. Note the use of the TRANS statement in the GPL to make a constant value to plot the rug of the distribution. Also note although such rugs are typically shown as bars, you could pretty much always use point markers as well in any situation where you use bars. Below the image is the GGRAPH code used to produce them.

******************************************.
*Regular density estimate with rug plot.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: V1=col(source(s), name("V1"))
  TRANS: rug = eval(-26)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), label("Density"))
  SCALE: linear(dim(2), min(-30))
  ELEMENT: interval(position(V1*rug), transparency.exterior(transparency."0.8"))
  ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))
END GPL.

*Density estimate with points instead of bars for rug.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: V1=col(source(s), name("V1"))
  TRANS: rug = eval(-15)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), label("Density"))
  SCALE: linear(dim(2), min(-30))
  ELEMENT: point(position(V1*rug), transparency.exterior(transparency."0.8"))
  ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))
END GPL.
******************************************.

Now bean plots are just the above plots rotatated 90 degrees, adding a reflection of the distribution (so the area of the density is represented in two dimensions), and then further paneled by another categorical variable. To do the reflection, one has to create a fake variable equal to the first variable used for the density estimate. But after that, it is just knowing alittle GGRAPH magic to make the plots.

******************************************.
compute V2 = V1.

varstocases
/make V from V1 V2
/index panel_dum.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V panel_dum groups MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  COORD: transpose(mirror(rect(dim(1,2))))
  DATA: V=col(source(s), name("V"))
  DATA: panel_dum=col(source(s), name("panel_dum"), unit.category())
  DATA: groups=col(source(s), name("groups"), unit.category())
  TRANS: zero = eval(10)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), null())
  GUIDE: axis(dim(3), null())
  SCALE: linear(dim(2), min(0))
  ELEMENT: area(position(density.kernel.epanechnikov(V*1*panel_dum*1*groups)), transparency.exterior(transparency."1.0"), transparency.interior(transparency."0.4"), 
           color.interior(color.grey), color.exterior(color.grey)))
  ELEMENT: interval(position(V*zero*panel_dum*1*groups), transparency.exterior(transparency."0.8"))
END GPL.
    ******************************************.

Note I did not label the density estimate anymore. I could have, but I would have had to essentially divide the density estimate by two, since I am showing it twice (which is possible, and if you wanted to show it you would omit the GUIDE: axis(dim(2), null()) command). But even without the axis they are still reasonable for relative comparisons. Also note the COORD statement for how I get the panels to mirror each other (the transpose statement just switches the X and Y axis in the charts).

I just post hoc edited the chart to get it to look nice (in particular settign the spacing between the panel_dum panel to zero and making the panel outlines transparent), but most of those things can likley be more steamlined by making an appropriate chart template. Two things I do not like, which I may need to edit the chart template to be able to accomplish anyway; 1) There is an artifact of a white line running down the density estimates, (it is hard to see with the rug, but closer inspection will show it), 2) I would prefer to have a box around all of the estimates and categories, but to prevent a streak running down the middle of the density estimates one needs to draw the panel boxes without borders. To see if I can accomplish these things will take further investigation.

This framework is easily extended to the case where you don’t want a reflection of the same variable, but want to plot the continuous distribution estimate of a second variable. Below is an example, and here I have posted the syntax in entirety used in making this post. In there I also have an example of weighting groups inversely proportional to the total items in each group, which should make the area of each group equal.

In this example of comparing groups, I utilize dots instead of the bar rug, as I believe it provides more contrast between the two distributions. Also note in general I have not superimposed other summary statistics (some of the bean plots have quartile lines super-imposed). You could do this, but it gets a bit busy.

11 Comments

by Andy Wheeler on May 20, 2012 • Permalink

Posted in Data Visualization, SPSS

Tagged data visualization, small-multiples, SPSS

Posted by Andy Wheeler on May 20, 2012

https://andrewpwheeler.com/2012/05/20/bean-plots-in-spss/

11 Comments

Louise S.

/ June 22, 2012

Hi Andrew,
Thanks for posting the syntax – it was really useful to follow the steps you took to make these. I’m just wondering how you could add a coloured point to distinguish the median in each group- would it be something like this:
ELEMENT: point(position(summary.mean(V*1*panel_dum*1*groups))?
thanks in advance for your help!
louise

apwheele

/ June 25, 2012

Hi Louise,

In theory, something like ELEMENT: point(position(summary.median(V*1*panel_dum*1*groups)), color.exterior(panel_dum), shape("Median"), size(size.large)) should work, although my quick attempts to get it to act as desired were unsucessful. I also tried to make an actual summary variable via the TRANS command in inline GPL and that did not work either. So what I ended up doing was making a new variable and plotting that in its own element statement. Note when you have multiple elements like this the legend gets a bit un-wieldy, and what I have been doing is editing post-hoc in a different vector editor to get the legend to how I want it. Part of the problem I think is that SPSS does not like mapping the same aesthetic to different types of elements, so sometimes it gives errors when trying to construct the legend.

Below is an example using the same data that is in the last plot in my blog post.


AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=groups panel_dum
  /V_median=MEDIAN(V).

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V V_median panel_dum groups MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  COORD: transpose(mirror(rect(dim(1,2))))
  DATA: V=col(source(s), name("V"))
  DATA: V_median=col(source(s), name("V_median"))
  DATA: panel_dum=col(source(s), name("panel_dum"), unit.category())
  DATA: groups=col(source(s), name("groups"), unit.category())
  TRANS: zero = eval(20)
  TRANS: med_point = eval(40)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), label("Frequency"))
  GUIDE: axis(dim(3), null())
  GUIDE: legend(aesthetic(aesthetic.color.exterior))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  GUIDE: legend(aesthetic(aesthetic.shape.interior), null())
  GUIDE: legend(aesthetic(aesthetic.transparency.exterior), null())
  GUIDE: legend(aesthetic(aesthetic.transparency.interior), null())
  GUIDE: legend(aesthetic(aesthetic.size), null())
  SCALE: linear(dim(2), min(0))
  SCALE: cat(aesthetic(aesthetic.shape.interior), map(("Median", shape.square), ("Rug", shape.circle)))
  SCALE: cat(aesthetic(aesthetic.transparency), map(("Median", transparency."0.0"), ("Rug", transparency."0.9")))
  SCALE: cat(aesthetic(aesthetic.size), map(("Median", size."8"), ("Rug", size."4")))
  ELEMENT: point(position(V*zero*panel_dum*1*groups), color.exterior(panel_dum), color.interior(panel_dum), transparency.interior("Rug"), transparency.exterior("Rug"), 
                           shape.interior("Rug"), size("Rug"))
  ELEMENT: point(position(V_median*zero*panel_dum*1*groups), color.exterior(panel_dum), color.interior(panel_dum), transparency.interior("Median"), transparency.exterior("Median"), 
                           shape.interior("Median"), size("Median"))
  ELEMENT: area(position(density.kernel.epanechnikov(V*1*panel_dum*1*groups)), transparency.exterior(transparency."1.0")), color(panel_dum),  transparency.interior(transparency."0.5")))
END GPL.

Thanks for the comment and if you have any other questions let me know here in the comments or feel free to shoot me an email.

Andy

Anton Balabanov
/ January 23, 2015

Hi, Andrew! Thanks for excellent examples of violin plots in SPSS. A bit off-topic, do you happen to know, how favourable are editors of scientific journals for such illustrations? I mean I didn’t see much papers using violins in particular in the domain of sociology or medicine. To some they might look not so “strict” as, say, boxplots: the width of violin changes, but there is no explicit axis underlying. Just interesting in your opinion/experience.
Anton

Reply
- apwheele
  / January 23, 2015
  
  It is a good question Anton. I have not used plots like these, but more typical histograms in publications. I would personally be ok with them, but I can’t speak to everyone. (It would not surprise me to see blow back from using them in a publication — but I rather get a critique of that than many other things.)
  
  If they do a good job showing differences that would be masked with summary statistics or other graphs I think they have a good argument for there use. Also see Error Bars Considered Harmful: Exploring Alternate Encodings for Mean and Error (Correll & Gleicher, 2014), for some experimental evidence that they are more effective for showing statistical tests (and not just kde’s of the original data).
  
  Reply
  - Anton Balabanov
    / January 27, 2015
    
    Thanks, Andrew, and also thanks for the link. Interesting insights on within-the-bar bias and overestimating differences with barchart error margins. Personally I liked the idea to show errors as gradually disappearing bars around statistic.
Andrew McLaren
/ November 29, 2024

Hi Andrew,

Great website and syntax. A question I was hoping you could assist with. When I bring your syntax across, change the variable names and press run I keep getting the following error:

>Error #99. Command name: BEGIN GPL
>This command is only valid immediately following the GGRAPH procedure
>Execution of this command stops

Not sure what I’m doing wrong as the Syntax looks all good when entered.

Assuming I;m in the right place in the sense of File > New > Syntax and then paste your Syntax

Any thoughts would be most appreciated!

Kind regards,

Andy McLaren

Reply
- apwheele
  / November 30, 2024
  
  So this error can be a bit of a red herring, it signals that there is likely something wrong with the command *prior* to BEGIN GPL, so the GGRAPH part may be in error. Also make sure to run the entire block from GGRAPH to END GPL at once.
  
  It is easier if you share your entire code, and I can give a quick look to see if I spot any errors.
  
  Reply
Andrew McLaren
/ November 30, 2024

Hi Andrew,

No problems. My variable is “MLDmm” (maximum linear dimension)

GGRAPH
/GRAPHDATASET NAME=”graphdataset” VARIABLES=MLDmm MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id(“graphdataset”))
DATA: MLDmm=col(source(s), name(“MLDmm”))
GUIDE: axis(dim(1), label(“MLDmm”))
GUIDE: axis(dim(2), label(“Density”))
SCALE: linear(dim(2), min(-30))
ELEMENT: line(position(density.kernel.epanechnikov(MLDmm*1)))
END GPL.

Reply
- Andrew McLaren
  / December 6, 2024
  
  Just following this up Andrew, thanks
  
  Reply
- apwheele
  / December 7, 2024
  
  Sorry, I do not see any obvious problems with the code (and do not have a current SPSS license to test quickly on my end). Only other piece of advice is to make sure you are selecting the entire block when running that code.
  
  Reply

The baseline in histograms & outliers | Andrew Wheeler

Andrew Wheeler

Bean plots in SPSS

SPSS code to make bean plots

11 Comments

Louise S.

apwheele

Anton Balabanov

apwheele

Anton Balabanov

Andrew McLaren

apwheele

Andrew McLaren

Andrew McLaren

apwheele

Leave a comment Cancel reply

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange

Andrew Wheeler

Bean plots in SPSS

SPSS code to make bean plots

Share this:

Related

11 Comments

Louise S.

apwheele

Anton Balabanov

apwheele

Anton Balabanov

Andrew McLaren

apwheele

Andrew McLaren

Andrew McLaren

apwheele

Leave a comment Cancel reply

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange