# Extracting items from SPSS tables using Python

Sometimes there are calculations provided for in SPSS tables that are necessary to use for other calculations. A frequent one is to grab certain percentiles from a `FREQUENCY` table (Equal Probability Histograms in SPSS is one example). The typical way to do this is to grab the table using OMS, but where that is overkill is if you need to merge this info. back into the original data for further calculations. I will show a brief example of grabbing the 25th, 50th, and 75th percentiles from a frequency table and using Python to calculate a robust standardized variable using these summary statistics.

First we will make a set of random data to work with.

``````SET SEED 10.
MATRIX.
SAVE {UNIFORM(100,1)} /OUTFILE = *.
END MATRIX.
DATASET NAME U.
FREQ COL1 /FORMAT = NOTABLE /PERCENTILES = 25 50 75.``````

The frequency table we are working with then looks like:

Now to get to the items in this frequency table we just to do a bit of going down a rabbit hole of different python objects.

• The first block grabs the items in the output, which include tables and text.
• The second block then grabs the last table for this specific output. Note that minus 2 from the size of the list is necessary because Python uses zero based indices and there is a log item after the table. So if the size of the list is 10, that means `list[9]` is the last item in the list. (Using negative indices does not work for extracting from the OutputItemList object.)
• The third part then grabs the quantiles from the indices of the table. It ends up being in the first data column (so zero) and in the 3rd, 4th and 5th rows (again, Python uses zero based indices). Using `GetUnformattedValueAt` grabs the floating point number.
• The final part then uses these quantiles to calculate a robust normalized variable by using `spss.Submit` and string substitution. (And then closes the SPSS client at the end.)

``````BEGIN PROGRAM Python.
import SpssClient, spss

#start the client, grab the items in the output
SpssClient.StartClient()
OutputDoc = SpssClient.GetDesignatedOutputDoc()
OutputItemList = OutputDoc.GetOutputItems()

#Grab the last table, 0 based index
lastTab = OutputItemList.Size() - 2
OutputItem = OutputItemList.GetItemAt(lastTab)
PivotTable = OutputItem.GetSpecificType()
SpssDataCells = PivotTable.DataCellArray()

#Grab the specific quantiles
Q25 = float(SpssDataCells.GetUnformattedValueAt(2,0))
Q50 = float(SpssDataCells.GetUnformattedValueAt(3,0))
Q75 = float(SpssDataCells.GetUnformattedValueAt(4,0))
print [Q25,Q50,Q75]

#Use these stats in SPSS commands
spss.Submit("COMPUTE QuantNormX = ( COL1 - %(Q50)f )/( %(Q75)f - %(Q25)f )." % locals())
SpssClient.StopClient()
END PROGRAM.``````

While the python code in terms of complexity is just about the same as using OMS to grab the frequency table and merge the quantiles back into the original data, this will be much more efficient. I can imagine using this for other projects too, like grabbing coefficients from a regression model and estimating certain marginal effects.

# Using the New York State Online Geocoding API with Python

I’ve been very lucky doing geographic analysis in New York state, as the majority of base map layers I need, and in particular streets centerline files for geocoding, are available statewide at the NYS GIS Clearing house. I’ve written in the past how to use various Google API’s for geo data, and here I will show how one can use the NYS SAM Address database and their ESRI online geocoding service. I explored this because Google’s terms of service are restrictive, and the NYS composite locator should be more comprehensive/up to date in matches (in theory).

So first, this is basically the same as with most online API’s (at least in my limited experience), submit a particular url and get JSON in return. You just then need to parse the JSON for whatever info you need. This is meant to be used within SPSS, but the function works with just a single field address string and returns the single top hit in a list of length 3, with the unicode string address, and then the x and y coordinates. (The function is of course a valid python function, so you could use this in any environment you want.) The coordinates are specified using ESRI’s WKID (see the list for projected and geographic coordinate systems). In the code I have it fixed as WKID 4326, which is WGS 1984, and so returns the longitude and latitude for the address. When the search returns no hits, it just returns a list of `[None,None,None]`.

``````*Function to use NYS geocoding API.
BEGIN PROGRAM Python.
import urllib, json

def ParsNYGeo(jBlob):
if not jBlob['candidates']:
data = [None,None,None]
else:
y = jBlob['candidates'][0]['location']['y']
x = jBlob['candidates'][0]['location']['x']
return data

wkid = "&maxLocations=1&outSR=4326"
end = "&f=pjson"
MyUrl = base + mid + wkid + end
soup = urllib.urlopen(MyUrl)
MyDat = ParsNYGeo(jsonData)
return MyDat

t1 = "100 Washington Ave, Albany, NY"
t2 = "100 Washington Ave, Poop"

Out = NYSGeo(t1)
print Out

Empt = NYSGeo(t2)
print Empt
END PROGRAM.``````

So you can see in the code sample that you need both the street address and the city in one field. And here is a quick example with some data in SPSS. Just the zip code doesn’t return any results. There is some funny results here though in this test run, and yes that Washington Ave. extension has caused me geocoding headaches in the past.

``````*Example using with SPSS data.
DATA LIST FREE / MyAdd (A100).
BEGIN DATA
"100 Washington Ave, Albany"
"100 Washinton Ave, Albany"
"100 Washington Ave, Albany, NY 12203"
"100 Washington Ave, Albany, NY, 12203"
"100 Washington Ave, Albany, NY 12206"
"100 Washington Ave, Poop"
"12222"
END DATA.

SPSSINC TRANS RESULT=GeoAdd lon lat TYPE=100 0 0

LIST ALL.``````

# ROC and Precision-Recall curves in SPSS

Recently I was tasked with evaluating a tool used to predict violence. I initially created some code to plot ROC curves in SPSS for multiple classifiers, but then discovered that the `ROC` command did everything I wanted. Some recommend precision-recall curves in place of ROC curves, especially when the positive class is rare. This fit my situation (a few more than 100 positive cases in a dataset of 1/2 million) and it was pretty simple to adapt the code to return the precision. I will not go into the details of the curves (I am really a neophyte at this prediction stuff), but here are a few resources I found useful:

The macro is named `!Roc` and it takes three parameters:

• `Class` – the numeric classifier (where higher equals a greater probability of being predicted)
• `Target` – the outcome you are trying to predict. Positive cases need to equal 1 and negative cases 0
• `Suf` – this the the suffix on the variables returned. The procedure returns “Sens[Suf]”, “Spec[Suf]” and “Prec[Suf]” (which are the sensitivity, specificity, and precision respectively).

So here is a brief made up example using the macro to draw ROC and precision and recall curves (entire syntax including the macro can be found here). So first lets make some fake data and classifiers. Here `Out` is the target being predicted, and I have two classifiers, `X` and `R`. `R` is intentionally made to be basically random. The last two lines show an example of calling the macro.

``````SET SEED 10.
INPUT PROGRAM.
LOOP #i = 20 TO 70.
COMPUTE X = #i + RV.UNIFORM(-10,10).
COMPUTE R = RV.NORMAL(45,10).
COMPUTE Out = RV.BERNOULLI(#i/100).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME RocTest.
DATASET ACTIVATE RocTest.
EXECUTE.

!Roc Class = X Target = Out Suf = "_X".
!Roc Class = R Target = Out Suf = "_R".``````

Now we can make an ROC curve plot with this information. Here I use inline `TRANS` statements to calculate 1 minus the specificity. I also use a blending trick in GPL to make the beginning of the lines connect at (0,0) and the end at (1,1).

``````*Now make a plot with both classifiers on it.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Spec_X Sens_X Spec_R Sens_R
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
PAGE: begin(scale(770px,600px))
SOURCE: s=userSource(id("graphdataset"))
DATA: Spec_X=col(source(s), name("Spec_X"))
DATA: Sens_X=col(source(s), name("Sens_X"))
DATA: Spec_R=col(source(s), name("Spec_R"))
DATA: Sens_R=col(source(s), name("Sens_R"))
TRANS: o = eval(0)
TRANS: e = eval(1)
TRANS: SpecM_X = eval(1 - Spec_X)
TRANS: SpecM_R = eval(1 - Spec_R)
COORD: rect(dim(1,2), sameRatio())
GUIDE: axis(dim(1), label("1 - Specificity"), delta(0.1))
GUIDE: axis(dim(2), label("Sensitivity"), delta(0.1))
GUIDE: text.title(label("ROC Curve"))
SCALE: linear(dim(1), min(0), max(1))
SCALE: linear(dim(2), min(0), max(1))
ELEMENT: edge(position((o*o)+(e*e)), color(color.lightgrey))
ELEMENT: line(position(smooth.step.right((o*o)+(SpecM_R*Sens_R)+(e*e))), color("R"))
ELEMENT: line(position(smooth.step.right((o*o)+(SpecM_X*Sens_X)+(e*e))), color("X"))
PAGE: end()
END GPL.``````

This just replicates the native SPSS `ROC` command though, and that command returns other useful information as well (such as the actual area under the curve). We can see though that my calculations of the curve are correct.

``````*Compare to SPSS's ROC command.
ROC R X BY Out (1)
/PLOT CURVE(REFERENCE)
/PRINT SE COORDINATES.``````

To make a precision-recall graph we need to use the `path` element and sort the data in a particular way. (SPSS’s `line` element works basically the opposite of the way we need it to produce the correct sawtooth pattern.) The blending trick does not work with this graph, but it is immaterial in interpreting the graph.

``````*Now make precision recall curves.
*To make these plots, need to reshape and sort correctly, so the path follows correctly.
VARSTOCASES
/MAKE Sens FROM Sens_R Sens_X
/MAKE Prec FROM Prec_R Prec_X
/MAKE Spec FROM Spec_R Spec_X
/INDEX Type.
VALUE LABELS Type
1 'R'
2 'X'.
SORT CASES BY Sens (A) Prec (D).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Sens Prec Type
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
PAGE: begin(scale(770px,600px))
SOURCE: s=userSource(id("graphdataset"))
DATA: Sens=col(source(s), name("Sens"))
DATA: Prec=col(source(s), name("Prec"))
DATA: Type=col(source(s), name("Type"), unit.category())
COORD: rect(dim(1,2), sameRatio())
GUIDE: axis(dim(1), label("Recall"), delta(0.1))
GUIDE: axis(dim(2), label("Precision"), delta(0.1))
GUIDE: text.title(label("Precision-Recall Curve"))
SCALE: linear(dim(1), min(0), max(1))
SCALE: linear(dim(2), min(0), max(1))
ELEMENT: path(position(Sens*Prec), color(Type))
PAGE: end()
END GPL.
*The sawtooth is typical.``````

These curves both show that `X` is the clear winner. In my use application the ROC curves are basically superimposed, but there is more separation in the precision-recall graph. Being very generic, most of the action in the ROC curve is at the leftmost area of the graph (with only a few positive cases), but the PR curve is better at identifying how wide you have to cast the net to find the few positive cases. In a nut-shell, you have to be willing to live with many false positives to be able to predict just the few positive cases.

I would be interested to hear other analysts perspective. Predicting violence is a popular topic in criminology, with models of varying complexity. But what I’m finding so far in this particular evaluation is basically that there are set of low hanging fruit of chronic offenders that score high no matter how much you crunch the numbers (around 60% of the people who committed serious violence in a particular year in my sample), and then a set of individuals with basically no prior history (around 20% in my sample). So basically ad-hoc scores are doing about as well predicting violence as more complicated machine learning models (even machine learning models fit on the same data).

# Emailing with Python and SPSS

Emailing automated messages using Python was on my bucket list for a few projects, so here I will illustrate how to do that within SPSS. Basically the use case is if you have an automated report generated by SPSS and you want to send that automated report to certain parties (or to yourself while you are away from work). Emailing right within Python cuts out the middle annoying task of having to email yourself.

There are basically two parts of emailing within Python, 1) building the message and 2) opening your server and sending the mail. The latter is pretty simple, the former is quite tedious though. So adapting from several posts around the internet (1,2,3 plus others I don’t remember at this point), here is a function to build the email text.

``````*function to build message.
BEGIN PROGRAM Python.
from os.path import basename
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.MIMEBase import MIMEBase
from email import Encoders
from email.utils import COMMASPACE, formatdate

def make_mail(send_from, send_to, subject, text, files=None):
assert isinstance(send_to, list)

msg = MIMEMultipart(
From=send_from,
To=COMMASPACE.join(send_to),
Date=formatdate(localtime=True),
Subject=subject
)
msg.attach(MIMEText(text))

if files != None:
for f in files:
part = MIMEBase('application', 'base64')
Encoders.encode_base64(part)
msg.attach(part)
return msg.as_string()
END PROGRAM.``````

The function subsequently takes as arguments:

• `send_from`: Your email address as a string
• `send_to`: A list of email addresses to send to.
• `subject`: A text string for the subject (please use a subject when you send an email!)
• `text`: The text composition of the email in a string
• `files`: A list of files with the full directory

Basically an email message is just a particular text format (which actually looking at the markup I’m slightly amazed email still functions at all). Building the markup for the to, from, subject and text in the email is tedious but relatively straightforward. However, attaching files (the main motivation for this to begin with!) is rather a pain in the butt. Here I just encode all the files in base64, and CSV, PDF, and PNG files have all worked out so far for me in my tests. (You can attach images as binary, but this approach seems to work fine at least for PNG images.)

So here is an example constructing a message, and I attach three example files. Here I just use my gmail address as both the from and to address. You can uncomment the `print MyMsg` at the end to see the particular markup, but it is quite long with the base64 attached files.

``````*Now lets make a message.
BEGIN PROGRAM Python.

us = "apwheele"
fr = us + "@gmail.com"
to = [fr]
te = "Hello!"

MyCSV = [r"C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Email_Python\Test.csv",
r"C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Email_Python\Test.pdf",
r"C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Email_Python\OUTPUT0.PNG"]

MyMsg = make_mail(send_from=fr,send_to=to,subject="Test",text=te,files=MyCSV)
#print MyMsg
END PROGRAM.``````

The second part is opening your email server and sending the message — relatively straight forward. Many people have their python functions for emailing with the username and password as part of the function. This does not make much sense to me, as they will be basically constants for a particular user, so I find it simpler to make the message and then open the server and send it. If you want to send multiple messages it makes more sense to open up the server just once. Below to make it work for yourself you just have to insert your own username and password (and possibly update the port number for your server).

``````*Now set up the server and send a message.
BEGIN PROGRAM Python.

import smtplib
server = smtplib.SMTP('smtp.gmail.com',587)
server.starttls()

server.sendmail(fr,to, MyMsg)
server.quit()
END PROGRAM.``````

I don’t have Outlook on any of my personal machines, but hopefully it is just as simple when sending an email through a client as it is through gmail.

# String substitution in Python continued

On my prior post Jon and Jignesh both made the comment that using `locals()` in string substitution provides for easier to read code – and is a great recommendation. What `locals()` does is grab the object from the local environment, so in your string if you place `%(MyObject)s`, and in the prior part of your code you have `MyObject = "foo"`, it will substitute `foo` into the string. Using the same prior code, here is an example:

``````BEGIN PROGRAM Python.

var = ["V1","V2","V3"]
lab = ["Var 1","Var 2","Var 3"]

for v,l in zip(var,lab):
spss.Submit("""
*Descriptive statistics.
FREQ %(v)s.
CORRELATIONS /VARIABLES=X1 X2 X3 %(v)s.
*Graph 1.
GRAPH /SCATTERPLOT(BIVAR)=X1 WITH %(v)s /TITLE = "%(l)s".
*Graph 2.
GRAPH /SCATTERPLOT(BIVAR)=X2 WITH %(v)s /TITLE = "%(l)s".
*Graph 3.
GRAPH /SCATTERPLOT(BIVAR)=X3 WITH %(v)s /TITLE = "%(l)s".
""" % (locals()))

END PROGRAM.``````

This ends up working in a similar fashion to the dictionary substitution I mentioned, just that Python makes the dictionary for you. Here is the same prior example using the dictionary with `%` substitution:

``````BEGIN PROGRAM Python.

var = ["V1","V2","V3"]
lab = ["Var 1","Var 2","Var 3"]

MyDict = {"a":v, "b":l}

for v,l in zip(var,lab):
spss.Submit("""
*Descriptive statistics.
FREQ %(a)s.
CORRELATIONS /VARIABLES=X1 X2 X3 %(a)s.
*Graph 1.
GRAPH /SCATTERPLOT(BIVAR)=X1 WITH %(a)s /TITLE = "%(b)s".
*Graph 2.
GRAPH /SCATTERPLOT(BIVAR)=X2 WITH %(a)s /TITLE = "%(b)s".
*Graph 3.
GRAPH /SCATTERPLOT(BIVAR)=X3 WITH %(a)s /TITLE = "%(b)s".
""" % (MyDict) )

END PROGRAM.``````

And finally here is this example using a string template. It is basically self-explanatory.

``````*Using a string template.
BEGIN PROGRAM Python.
from string import Template

#Making template
c = Template("""*Descriptive statistics.
FREQ \$var.
CORRELATIONS /VARIABLES=X1 X2 X3 \$var.
*Graph 1.
GRAPH /SCATTERPLOT(BIVAR)=X1 WITH \$var /TITLE = "\$lab".
*Graph 2.
GRAPH /SCATTERPLOT(BIVAR)=X2 WITH \$var /TITLE = "\$lab".
*Graph 3.
GRAPH /SCATTERPLOT(BIVAR)=X3 WITH \$var /TITLE = "\$lab".
""")

#now looping over variables and labels
var = ["V1","V2","V3"]
lab = ["Var 1","Var 2","Var 3"]
for v,l in zip(var,lab):
spss.Submit(c.substitute(var=v,lab=l))
END PROGRAM.``````

The template solution is nice because it can make the code a bit more modular (i.e. you don’t have huge code blocks within a loop). The only annoyance I can see with this is that `\$` does come up in SPSS code on occasion with the few system defined variables, so it needs to be escaped with a second `\$`.

# String substitution in Python

I recently had a set of SPSS syntax that iterated over multiple variables and generated many similar graphs. They were time series graphs of multiple variables, so the time stayed the same but the variable on the Y axis changed and the code also changed the titles of the graphs. Initially I was using the `%` string substitution with a (long) list of replacements. Here is a brief synonymous example.

``````*Creating some fake data.
MATRIX.
SAVE {UNIFORM(100,6)} /OUTFILE = * /VARIABLES = V1 TO V3 X1 TO X3.
END MATRIX.
DATASET NAME x.

*Crazy long string substitution.
BEGIN PROGRAM Python.
import spss

#my variable and label lists
var = ["V1","V2","V3"]
lab = ["Var 1","Var 2","Var 3"]

for v,l in zip(var,lab):
spss.Submit("""
*Descriptive statistics.
FREQ %s.
CORRELATIONS /VARIABLES=X1 X2 X3 %s.
*Graph 1.
GRAPH /SCATTERPLOT(BIVAR)=X1 WITH %s /TITLE = "%s".
*Graph 2.
GRAPH /SCATTERPLOT(BIVAR)=X2 WITH %s /TITLE = "%s".
*Graph 3.
GRAPH /SCATTERPLOT(BIVAR)=X3 WITH %s /TITLE = "%s".
""" % (v,v,v,l,v,l,v,l))
END PROGRAM.``````

When you only have to substitute one or two things, `"str %s and %s" % (one,two)` is no big deal, but here is quite annoying having to keep track of the location of all the separate variables when the list grows. Also we are really just recycling the same object to be replaced multiple times. I thought this is python, so there must be an easier way, and sure enough there is! A simple alternative is to use the format modifier to a string object. Format can take a vector of arguments, so the prior example would be `"str {0} and {1}".format(one,two)`. Instead of `%s`, you place brackets and the index position of the argument (and Python has zero based indices, so the first element is always 0).

Here is the SPSS syntax updated to use format for string substitution.

``````*This is much simpler using ".format" for substitution.
BEGIN PROGRAM Python.

var = ["V1","V2","V3"]
lab = ["Var 1","Var 2","Var 3"]

for v,l in zip(var,lab):
spss.Submit("""
*Descriptive statistics.
FREQ {0}.
CORRELATIONS /VARIABLES=X1 X2 X3 {0}.
*Graph 1.
GRAPH /SCATTERPLOT(BIVAR)=X1 WITH {0} /TITLE = "{1}".
*Graph 2.
GRAPH /SCATTERPLOT(BIVAR)=X2 WITH {0} /TITLE = "{1}".
*Graph 3.
GRAPH /SCATTERPLOT(BIVAR)=X3 WITH {0} /TITLE = "{1}".
""".format(v,l))
END PROGRAM.``````

Much simpler. You can use a dictionary with the `%` substitution to the same effect, but here the format modifier is a quite simple solution. Another option I might explore more in the future are using string templates, which seem a good candidate for long strings of SPSS code.

# Continuous color ramps in SPSS

For graphs in syntax SPSS can specify continuous color ramps. Here I will illustrate a few tricks I have found useful, as well as provide alternatives to the default rainbow color ramps SPSS uses when you don’t specify the colors yourself. First we will start with a simple set of fake data.

``````INPUT PROGRAM.
LOOP #i = 1 TO 30.
LOOP #j = 1 TO 30.
COMPUTE x = #i.
COMPUTE y = #j.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Col.
FORMATS x y (F2.0).
EXECUTE.``````

The necessary GGRAPH code to make a continuous color ramp is pretty simple, just include a scale variable and map it to a color.

``````*Colors vary by X, bar graph default rainbow.
TEMPORARY.
SELECT IF y = 1.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"), unit.category())
DATA: xC=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
SCALE: linear(dim(2), min(0), max(1))
ELEMENT: interval(position(x*y), shape.interior(shape.square), color.interior(xC),
transparency.exterior(transparency."1"))
END GPL.
EXECUTE.``````

The `TEMPORARY` statement is just so the bars have only one value passed, and in the inline GPL I also specify that the outsides of the bars are fully transparent. The necessary code is simply creating a variable, here `xC`, that is continuous and mapping it to a color in the `ELEMENT` statement using `color.interior(xC)`. Wilkinson in the Grammar of Graphics discusses that even for continuous color ramps he prefers a discrete color legend, which is the behavior in SPSS.

The default color ramp is well known to be problematic, so I will provide some alternatives. A simple choice suitable for many situations is simply a grey-scale chart. To make this you have to make a separate `SCALE` statement in the inline GPL, and set the `aestheticMinimum` and the `aestheticMaximum`. Besides that one additional `SCALE` statement, the code is the same as before.

``````*Better color scheme based on grey-scale.
TEMPORARY.
SELECT IF y = 1.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"), unit.category())
DATA: xC=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
SCALE: linear(dim(2), min(0), max(1))
SCALE: linear(aesthetic(aesthetic.color.interior),
aestheticMinimum(color.lightgrey), aestheticMaximum(color.black))
ELEMENT: interval(position(x*y), shape.interior(shape.square), color.interior(xC),
transparency.exterior(transparency."1"))
END GPL.
EXECUTE.``````

Another option I like is to make a grey-to-red scale ramp (which is arguably diverging or continuous).

``````*Grey to red.
TEMPORARY.
SELECT IF y = 1.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"), unit.category())
DATA: xC=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
SCALE: linear(dim(2), min(0), max(1))
SCALE: linear(aesthetic(aesthetic.color.interior),
aestheticMinimum(color.black), aestheticMaximum(color.red))
ELEMENT: interval(position(x*y), shape.interior(shape.square), color.interior(xC),
transparency.exterior(transparency."1"))
END GPL.
EXECUTE.``````

To make an nice looking interpolation with these anchors is pretty difficult, but another one I like is the green to purple. It ends up looking quite close to the associated discrete color ramp from the ColorBrewer palettes.

``````*Diverging color scale, purple to green.
TEMPORARY.
SELECT IF y = 1.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"), unit.category())
DATA: xC=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
SCALE: linear(dim(2), min(0), max(1))
SCALE: linear(aesthetic(aesthetic.color.interior),
aestheticMinimum(color.green), aestheticMaximum(color.purple))
ELEMENT: interval(position(x*y), shape.interior(shape.square), color.interior(xC),
transparency.exterior(transparency."1"))
END GPL.
EXECUTE.``````

In cartography, whether one uses diverging or continuous ramps is typically related to the data, e.g. if the data has a natural middle point use diverging (e.g. differences with zero at the middle point). I don’t really like this advice though, as pretty much any continuous number can be reasonably turned into a diverging number (e.g. continuous rates to location quotients, splitting at the mean, residuals from a regression, whatever). So I would make the distinction like this, the ramp decides what elements you end up emphasizing. If you want to emphasize the extremes of both ends of the distribution use a diverging ramp, if you only want to emphasize one end use a continuous ramp. There are many situations with natural continuous numbers that we want to emphasize both ends of the ramp based on the questions the map or graph is intended to answer.

Going along with this, you may want the middle break to not be in the middle of the actual data. To set the anchors according to an external benchmark, you can use the `min` and `max` function within the same `SCALE` statement that you specify the colors. Here is an example with the black-to-red color ramp, but I set the minimum lower than the data are, so the ramp starts at a more grey location.

``````*Setting the break with the external min.
TEMPORARY.
SELECT IF y = 1.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"), unit.category())
DATA: xC=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
SCALE: linear(dim(2), min(0), max(1))
SCALE: linear(aesthetic(aesthetic.color.interior),
aestheticMinimum(color.black), aestheticMaximum(color.red),
min(-10), max(30))
ELEMENT: interval(position(x*y), shape.interior(shape.square), color.interior(xC),
transparency.exterior(transparency."1"))
END GPL.
EXECUTE.``````

Another trick I like using often is to map discrete colors, and then use transparency to create a continuous ramp (most of the examples I use here could be replicated by specifying the color saturation as well). Here I use two colors and make points more towards the center of the graph more transparent. This can be extended to multiple color bins, see this post on Stackoverflow. Related are value-by-alpha maps, using more transparent to signify more uncertainty in the data (or to draw less attention to those areas). (That linked stackoverflow post wanted to emphasize the middle break for diverging data, but the point remains the same, make things you want to de-emphasize more transparent and things to want to emphasize more saturated.)

``````*Using transparency and fixed color bins.
RECODE x (1 THRU 15 = 1)(ELSE = 2) INTO XBin.
FORMATS XBin (F1.0).
COMPUTE Dis = -1*SQRT((x - 15)**2 + (y - 15)**2).
FORMATS Dis (F2.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y Dis XBin
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
DATA: Dis=col(source(s), name("Dis"))
DATA: XBin=col(source(s), name("XBin"), unit.category())
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color.green),("2",color.purple)))
ELEMENT: point(position(x*y), color.interior(XBin),
transparency.exterior(transparency."1"), transparency.interior(Dis))
END GPL.``````

Another powerful visualization tool to emphasize (or de-emphasize) certain points is to map the size of an element in addition to transparency. This is a great tool to add more information to scatterplots.

``````*Using redundant with size.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y Dis XBin
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
DATA: Dis=col(source(s), name("Dis"))
DATA: XBin=col(source(s), name("XBin"), unit.category())
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
SCALE: linear(aesthetic(aesthetic.size), aestheticMinimum(size."1"),
aestheticMaximum(size."18"), reverse())
SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color.green),("2",color.purple)))
ELEMENT: point(position(x*y), color.interior(XBin),
transparency.exterior(transparency."1"), transparency.interior(Dis), size(Dis))
END GPL.``````

Finally, if you want SPSS to omit the legend (or certain aesthetics in the legend) you have to specify a `GUIDE: legend` statement for every mapped aesthetic. Here is the previous scatterplot omitting all legends.

``````*IF you want the legend omitted.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=x y Dis XBin
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: x=col(source(s), name("x"))
DATA: y=col(source(s), name("y"))
DATA: Dis=col(source(s), name("Dis"))
DATA: XBin=col(source(s), name("XBin"), unit.category())
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), null())
GUIDE: legend(aesthetic(aesthetic.color), null())
GUIDE: legend(aesthetic(aesthetic.transparency), null())
GUIDE: legend(aesthetic(aesthetic.size), null())
SCALE: linear(aesthetic(aesthetic.size), aestheticMinimum(size."1"),
aestheticMaximum(size."18"), reverse())
SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color.green),("2",color.purple)))
ELEMENT: point(position(x*y), color.interior(XBin),
transparency.exterior(transparency."1"), transparency.interior(Dis), size(Dis))
END GPL.``````

# Some examples of using DO REPEAT in SPSS

For the majority of data management I do in SPSS, the brunt of the work is likely done in under 10 different commands. DO REPEAT is one of those commands, and I figured I would show some examples of its use.

In its simplest form, `DO REPEAT` simply iterates over a list of variables and can apply computations to those variables. So say in a survey you had a skip pattern, and the variables A, B, C should not be answered if someone has a value of 1 for Skip, but your current dataset just has missing data as system missing. We can use `DO REPEAT` to iterate over the variables and assign the missing value code.

``````DO REPEAT v = A B C.
IF Skip = 1 v = 9.
END REPEAT.``````

Note this is not a great example, as you could simply use a `DO IF Skip = 1.` and nest a `RECODE` in that do if, but hopefully that is a clear example to start. One of the other tricks to `DO REPEAT` is that you can specify a counter variable to iterate over at the same time. So say you had a variable X that took integer values of 1 to 4. If you want to make dummy codes for say a regression equation, using such a counter makes short work of the process. (The `VECTOR` command creates the 4 original dummy variables.)

``````VECTOR X(4,F1.0).
DO REPEAT Xn = X1 TO X4 /#i = 1 TO 4.
COMPUTE Xn = (X = #i).
END REPEAT.``````

A final trick I often find use for is to make a list of strings instead of a list of variables. For instance, say I had a list of addresses and I wanted to specifically pull out people on Main, 1st, or Central. You could do:

``````COMPUTE Flag = 0.
IF CHAR.INDEX(Street,"Main") > 0 Flag = 1.
IF CHAR.INDEX(Street,"1st") > 0 Flag = 2.
IF CHAR.INDEX(Street,"Central") > 0 Flag = 3.``````

But it is much easier to just submit your own list of strings to `DO REPEAT`:

``````COMPUTE Flag = 0.
DO REPEAT str = "Main" "1st" "Central" /#i = 1 TO 3.
IF CHAR.INDEX(Street,str) > 0 Flag = #i.
END REPEAT.``````

You can similarly submit arbitrary lists of numeric values as well. With much longer lists you can see how much more expedited this code is. Anytime I find myself writing a series of very similar `COMPUTE` or `IF` statements, or chaining many variables together in one statement, I often rewrite the code to use `DO REPEAT`.

# Labeling tricks in SPSS plots

The other day I noticed when making the labels for treemaps that when using a polygon element in inline GPL SPSS places the label directly on the centroid. This is opposed to offsetting the label when using a point element. We can use this to our advantage to force labels in plots to be exactly where we want them. (Syntax to replicate all of this here.)

One popular example I have seen is in state level data to use the state abbreviation in a scatterplot instead of a point. Here is an example using the college degree and obesity example originally via Albert Cairo.

``````FILE HANDLE save /NAME = "!!!!Your Handle Here!!!".
*Data obtained from http://vizwiz.blogspot.com/2013/01/alberto-cairo-three-steps-to-become.html
*Data was originally from VizWiz blog https://dl.dropbox.com/u/14050515/VizWiz/obesity_education.xls.
*That link does not work anymore though, see https://www.dropbox.com/s/lfwx7agkraci21y/obesity_education.xls?dl=0.
*For my own dropbox link to the data.
GET DATA /TYPE=XLS
/FILE='save\obesity_education.xls'
/SHEET=name 'Sheet1'
/CELLRANGE=full
/ASSUMEDSTRWIDTH=32767.
DATASET NAME Obs.
MATCH FILES FILE = *
/DROP V5.
FORMATS BAORHIGHER OBESITY (F2.0).

*Scatterplot with States and labels.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=BAORHIGHER OBESITY StateAbbr
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: BAORHIGHER=col(source(s), name("BAORHIGHER"))
DATA: OBESITY=col(source(s), name("OBESITY"))
DATA: StateAbbr=col(source(s), name("StateAbbr"), unit.category())
GUIDE: axis(dim(1), label("% BA or Higher"))
GUIDE: axis(dim(2), label("% Obese"))
ELEMENT: point(position(BAORHIGHER*OBESITY), label(StateAbbr))
END GPL.``````

Here you can see the state abbreviations are off-set from the points. A trick to just plot the labels, but not the points, is to draw the points fully transparent. See the `transparency.exterior(transparency."1")` in the `ELEMENT` statement.

`````` GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=BAORHIGHER OBESITY StateAbbr
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: BAORHIGHER=col(source(s), name("BAORHIGHER"))
DATA: OBESITY=col(source(s), name("OBESITY"))
DATA: StateAbbr=col(source(s), name("StateAbbr"), unit.category())
GUIDE: axis(dim(1), label("% BA or Higher"))
GUIDE: axis(dim(2), label("% Obese"))
ELEMENT: point(position(BAORHIGHER*OBESITY), label(StateAbbr),
transparency.exterior(transparency."1"))
END GPL.``````

But, this does not draw the label exactly at the data observation. You can post-hoc edit the chart to specify that the label is drawn at the middle of the point element, but another option directly in code is to specify the element as a `polygon` instead of a point. By default this is basically equivalent to using a point element, since we do not pass the function an actual polygon using the `link.???` functions.

``````GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=BAORHIGHER OBESITY StateAbbr
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: BAORHIGHER=col(source(s), name("BAORHIGHER"))
DATA: OBESITY=col(source(s), name("OBESITY"))
DATA: StateAbbr=col(source(s), name("StateAbbr"), unit.category())
GUIDE: axis(dim(1), label("% BA or Higher"))
GUIDE: axis(dim(2), label("% Obese"))
ELEMENT: polygon(position(BAORHIGHER*OBESITY), label(StateAbbr), transparency.exterior(transparency."1"))
END GPL.``````

To show that this now places the labels directly on the data values, here I superimposed the data points as red dots over the labels.

``````GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=BAORHIGHER OBESITY StateAbbr
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: BAORHIGHER=col(source(s), name("BAORHIGHER"))
DATA: OBESITY=col(source(s), name("OBESITY"))
DATA: StateAbbr=col(source(s), name("StateAbbr"), unit.category())
GUIDE: axis(dim(1), label("% BA or Higher"))
GUIDE: axis(dim(2), label("% Obese"))
ELEMENT: polygon(position(BAORHIGHER*OBESITY), label(StateAbbr), transparency.exterior(transparency."1"))
ELEMENT: point(position(BAORHIGHER*OBESITY), color.interior(color.red))
END GPL.``````

Note to get the labels like this my chart template specifies the style of data labels as:

`````` <!-- Custom data labels for points -->
<setGenericAttributes elementName="labeling" parentName="point" count="0" styleName="textFrameStyle" color="transparent" color2="transparent"/>
``````

The first style tag specifies the font, and the second specifies the background color and outline (my default was originally a white background with a black outline). The option `styleOnly="true"` makes it so the labels are not always generated in the chart. Another side-effect of using a polygon element I should note is that SPSS draws all of the labels. When labelling point elements SPSS does intelligent labelling, and does not label all of the points if many overlap (and tries to place the labels at non-overlapping locations). It is great for typical scatterplots, but here I do not want that behavior.

Other examples I think this would be useful are for simple dot plots in which the categories have meaningful labels. Here is an example using a legend, and one has to learn the legend to understand the graph. I like making the points semi-transparent, so when they overlap you can still see the different points (see here for an example with data that actually overlap).

Such a simple graph though we can plot the category labels directly.

The direct labelling will not work out so well if if many of the points overlap, but jittering or dodging can be used then as well. (You will have to jitter or dodge the data yourself if using a polygon element in inline GPL. If you want to use a point element again you can post hoc edit the chart so that the labels are at the middle of the point.)

Another example is that I like to place labels in line graphs at the end of the line. Here I show an example of doing that by making a separate label for only the final value of the time series, offsetting to the right slightly, and then placing the invisible polygon element in the chart.

``````SET SEED 10.
INPUT PROGRAM.
LOOP #M = 1 TO 5.
LOOP T = 1 TO 10.
COMPUTE ID = #M.
COMPUTE Y = RV.NORMAL(#M*5,1).
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME TimeSeries.
FORMATS T ID Y (F3.0).
ALTER TYPE ID (A1).

STRING Lab (A1).
IF T = 10 Lab = ID.
EXECUTE.

*Labelled at end of line.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=T Y ID Lab MISSING=VARIABLEWISE
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: T=col(source(s), name("T"))
DATA: Y=col(source(s), name("Y"))
DATA: ID=col(source(s), name("ID"), unit.category())
DATA: Lab=col(source(s), name("Lab"), unit.category())
TRANS: P=eval(T + 0.2)
GUIDE: axis(dim(1), label("T"))
GUIDE: axis(dim(2), label("Y"))
SCALE: linear(dim(1), min(1), max(10.2))
ELEMENT: line(position(T*Y), split(ID))
ELEMENT: polygon(position(P*Y), label(Lab), transparency.exterior(transparency."1"))
END GPL.``````

This works the same for point elements too, but this forces the label to be drawn and they are drawn at the exact location. See the second image in my peremptory challenge post for example behavior when labelling with the point element.

You can also use this to make random annotations in the graph. Code omitted here (it is available in the syntax at the beginning), but here is an example:

If you want you can then style this label separately when post-hoc editing the chart. Here is a silly example. (The workaround to use annotations like this suggests a change to the grammar to allow `GUIDES` to have a special type specifically for just a label where you supply the x and y position.)

This direct labelling just blurs the lines between tables and graphs some more, what Tukey calls semi-graphic. I recently came across the portmanteau, grables (graphs and tables) to describe such labelling in graphs as well.

# Translating between the dispersion term in a negative binomial regression and random variables in SPSS

NOTE!! – when I initially posted this I was incorrect, I thought SPSS listed the dispersion term in the form of `Var(x) = mean + mean*dispersion`. But I was wrong, and it is `Var(x) = 1 + mean*dispersion` (the same as Stata’s, what Cameron and Trivedi call the NB2 model, as cited in the Long and Freese Stata book for categorical variables.) The simulation in the original post worked out because my example I used the mean as 1, here I update it to have a mean of 2 to show the calculations are correct. (Also note that this parametrization is equivalent to `Var(x) = mean*(1 + mean*dispersion)`, see Stata’s help for nbreg.)

When estimating a negative binomial regression equation in SPSS, it returns the dispersion parameter in the form of:

``Var(x) = 1 + mean*dispersion``

When generating random variables from the negative binomial distribution, SPSS does not take the parameters like this, but the more usual N trials with P successes. Stealing a bit from the R documentation for `dnbinom`, I was able to translate between the two with just a tedious set of algebra. So with our original distribution being:

``````Mean = mu
Variance = 1 + mu*a``````

R has an alternative representation closer to SPSS’s based on:

``````Mean = mu
Variance = mu + mu^2/x``````

Some tedious algebra will reveal that in this notation `x = mu^2/(1 - mu + a*mu)` (note to future self, using Solve in Wolfram Alpha could have saved some time, paper and ink). Also, R’s help for `dbinom` states that in the original N and P notation that `p = x/(x + mu)`. So here with `mu` and `a` (again `a` is the dispersion term as reported by GENLIN in SPSS) we can solve for `p`.

``````x = mu^2/(1 - mu + a*mu)
p = x/(x + mu)``````

And since `p` is solved, R lists the mean of the distribution in the N and P notation as:

``n*(1-p)/p = mu``

So with `p` solved we can figure out N as equal to:

``mu*p/(1-p) = n``

So to reiterate, if you have a mean of 2 and dispersion parameter of 4, the resultant N and P notation would be:

``````mu = 2
a = 4
x = mu^2/(1 - mu + a*mu) = 2^2/(1 - 2 + 4*2) = 4/7
p = x/(x + mu) = (4/7)/(4/7 + 2) = 2/9
n = mu*p/(1-p) = 2*(2/9)/(7/9) = 4/7``````

Here we can see that in the N and P notation the similar negative binomial model results in a fractional number of successes, which might be a surprising result for some that it is even a possibility. (There is likely an easier way to do this translation, but forgive me I am not a mathematician!)

Now we would be finished, but unfortunately SPSS’s negative binomial random functions only take integer values and do not take values of N less than 1 (R’s `dnbinom` does). So we have to do another translation of the N and P notation to the gamma distribution to be able to draw random numbers in SPSS. Another representation of the negative binomial model is a mixture of Poisson distributions, with the distribution of the mixtures being from a gamma distribution. Wikipedia lists a translation from the N and P notation to a gamma with shape = N and scale = P/(1-P).

So I wrapped these computations up in an SPSS macros that takes the mean and the dispersion parameter, calculates N and P under the hood, and then draws a random variable from the associated negative binomial distribution.

``````DEFINE !NegBinRV (mu = !TOKENS(1)
/disp = !TOKENS(1)
/out = !TOKENS(1) )
COMPUTE #x = !mu**2/(1 - !mu + !disp*!mu).
COMPUTE #p = #x / (#x + !mu).
COMPUTE #n = !mu*#p/(1 - #p).
COMPUTE #G = RV.GAMMA(#n,#p/(1 - #p)).
COMPUTE !Out = RV.POISSON(#G).
FORMATS !Out (F5.0).
!ENDDEFINE.``````

I am not sure if it is possible to use this gamma representation and native SPSS functions to calculate the corresponding CDF and PDF of the negative binomial distribution. But we can use R to do that. Here is an example of keeping the mean at 1 and varying the dispersion parameter between 0 and 5.

``````BEGIN PROGRAM R.
library(ggplot2)
x <- expand.grid(0:10,1:5)
names(x) <- c("Int","Disp")
mu <- 1
x\$PDF <- mapply(dnbinom, x=x\$Int, size=mu^2/(1 - mu + x\$Disp*mu), mu=mu)
t <- data.frame(cbind(0:10,rep(0,11),dpois(0:10,lambda=1)))
names(t) <- c("Int","Disp","PDF")
x <- rbind(t,x)
p <- ggplot(data = x, aes(x = Int, y = PDF, group = as.factor(Disp))) + geom_line()
p
#for the CDF
x\$CDF <- ave(x\$PDF, x\$Disp, FUN = cumsum)
END PROGRAM.``````

Here you can see how the larger dispersion term can easily approximate the zero inflation typical in criminal justice data (see an applied example from my work). R will not take a dispersion parameter of zero in this notation (as the size would be divided by zero and not defined), so I just tacked on the Poisson distribution with a mean of zero.

Here is an example of generating random data from a negative binomial distribution with a mean of 2 and a dispersion parameter of 4. I then grab the PDF from R, and superimpose them both on a chart in SPSS (or perhaps I should call it a PMF, since it only has support on integer values). You can see the simulation with 10,000 observations is a near perfect fit (so a good sign I did not make any mistakes!)

``````*Simulation In SPSS.
INPUT PROGRAM.
LOOP Id = 1 TO 10000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME RandNB.

!NegBinRV mu = 2 disp = 4 out = NB.

*Making seperate R dataset of PDF.
BEGIN PROGRAM R.
mu <- 2
disp <- 4
x <- 0:11
pdf <- dnbinom(x=x,size=mu^2/(1 - mu + disp*mu),mu=mu)
pdf[max(x)+1] <- 1 - sum(pdf[-(max(x)+1)])
MyDf <- data.frame(cbind(x,pdf))
END PROGRAM.
EXECUTE.
STATS GET R FILE=* /GET DATAFRAME=MyDf DATASET=PDF_NB.
DATASET ACTIVATE PDF_NB.
FORMATS x (F2.0).
VALUE LABELS x 11 '11 or More'.

*Now superimposing bar plot and PDF from separate datasets.
DATASET ACTIVATE RandNB.
RECODE NB (11 THRU HIGHEST = 11)(ELSE = COPY) INTO NB_Cat.
FORMATS NB_Cat (F2.0).
VALUE LABELS NB_Cat 11 '11 or More'.

GGRAPH
/GRAPHDATASET NAME="Data" DATASET='RandNB' VARIABLES=NB_Cat[LEVEL=ORDINAL] COUNT()[name="COUNT"]
/GRAPHDATASET NAME="PDF" DATASET='PDF_NB' VARIABLES=x pdf
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: Data=userSource(id("Data"))
DATA: NB_Cat=col(source(Data), name("NB_Cat"), unit.category())
DATA: COUNT=col(source(Data), name("COUNT"))
SOURCE: PDF=userSource(id("PDF"))
DATA: x=col(source(PDF), name("x"), unit.category())
DATA: den=col(source(PDF), name("pdf"))
TRANS: den_per = eval(den*100)
GUIDE: axis(dim(1))
GUIDE: axis(dim(2))
SCALE: linear(dim(2), include(0))
ELEMENT: interval(position(summary.percent(NB_Cat*COUNT)), shape.interior(shape.square))
ELEMENT: point(position(x*den_per), color.interior(color.black), size(size."8"))
END GPL.``````